Test Qwen(https://huggingface.co/Qwen) models via vLLM Neuron.
- Neuron SDK 2.20.1
- Vllm v0.6.1.post2
- transformer-neuronx https://github.com/bevhanno/transformers-neuronx.git@release2.20 (Thanks @bevhanno contribute)
Verified models
- Qwen/Qwen2.5-0.5B-Instruct
- Qwen/Qwen2.5-Coder-1.5B-Instruct
-
Launch an inf2.8xl instance, and ensure neuron driver is intalled.
-
Get pre-build docker image
docker pull cszhzleo/qwen2-vllm-neuron:v2.20.1
docker tag cszhzleo/qwen2-vllm-neuron:v2.20.1 neuron-container:qwen2
- Run container Before running test, check whether map path is existing.
docker run --rm --name neuron_vllm --shm-size=50gb \
--device /dev/neuron0 \
-p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server --model=Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=24 --max-model-len=1024 --block-size=1024
- Test
curl -X POST -H "Content-Type: application/json" http://localhost:8000/v1/completions \
-d '{"model":"Qwen/Qwen2.5-0.5B-Instruct","prompt": "tell me a story about New York city","max_tokens": 100, "stream":false}'
Navigate to artifacts directory
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
mkdir ~/artifacts/install
cd ~/artifacts/install
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch
cd ~/artifacts
cp arg_utils.py ./install/vllm/vllm/engine/
cp setup.py ./install/vllm/
cp neuron.py ./install/vllm/vllm/model_executor/model_loader/
# Build docker container
docker build -t neuron-container:qwen2 .
To reduce model download time, you can download model to local directory in advance. Here we use /home/ec2-user/environment/work/models
to store model file and map it to container.
The compiled NEFF files are save in the compiled
directory, it can reduce model download and compile time during runtime.
docker run --rm --name neuron_vllm --shm-size=50gb \
--device /dev/neuron0 -v /home/ec2-user/environment/work/models/:/models \
-p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server \
--model=/models/Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=1 \
--max-model-len=1024 --block-size=1024