qwen2-vllm-neuron

Getting started

Test Qwen(https://huggingface.co/Qwen) models via vLLM Neuron.

Neuron SDK 2.20.1
Vllm v0.6.1.post2
transformer-neuronx https://github.com/bevhanno/transformers-neuronx.git@release2.20 (Thanks @bevhanno contribute)

Verified models

Qwen/Qwen2.5-0.5B-Instruct
Qwen/Qwen2.5-Coder-1.5B-Instruct

How to Test

Launch an inf2.8xl instance, and ensure neuron driver is intalled.
Get pre-build docker image

docker pull cszhzleo/qwen2-vllm-neuron:v2.20.1
docker tag cszhzleo/qwen2-vllm-neuron:v2.20.1 neuron-container:qwen2

Run container Before running test, check whether map path is existing.

docker run --rm --name neuron_vllm --shm-size=50gb \
--device /dev/neuron0 \
-p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server --model=Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=24 --max-model-len=1024 --block-size=1024

Test

curl -X POST -H "Content-Type: application/json" http://localhost:8000/v1/completions \
 -d '{"model":"Qwen/Qwen2.5-0.5B-Instruct","prompt": "tell me a story about New York city","max_tokens": 100, "stream":false}'

Customize docker image

Navigate to artifacts directory

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
mkdir ~/artifacts/install
cd ~/artifacts/install
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch

cd ~/artifacts
cp arg_utils.py ./install/vllm/vllm/engine/
cp setup.py ./install/vllm/
cp neuron.py ./install/vllm/vllm/model_executor/model_loader/

# Build docker container
docker build -t neuron-container:qwen2 .

Tips

To reduce model download time, you can download model to local directory in advance. Here we use /home/ec2-user/environment/work/models to store model file and map it to container. The compiled NEFF files are save in the compiled directory, it can reduce model download and compile time during runtime.

docker run --rm --name neuron_vllm --shm-size=50gb \
    --device /dev/neuron0 -v /home/ec2-user/environment/work/models/:/models \
    -p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server \
    --model=/models/Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=1 \
    --max-model-len=1024 --block-size=1024

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
artifacts		artifacts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

qwen2-vllm-neuron

Getting started

How to Test

Customize docker image

Tips

About

Uh oh!

Releases

Packages

Uh oh!

Languages

alapha23/qwen2-vllm-neuron

Folders and files

Latest commit

History

Repository files navigation

qwen2-vllm-neuron

Getting started

How to Test

Customize docker image

Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages