Skip to content

alapha23/qwen2-vllm-neuron

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

qwen2-vllm-neuron

Getting started

Test Qwen(https://huggingface.co/Qwen) models via vLLM Neuron.

Verified models

  • Qwen/Qwen2.5-0.5B-Instruct
  • Qwen/Qwen2.5-Coder-1.5B-Instruct

How to Test

  1. Launch an inf2.8xl instance, and ensure neuron driver is intalled.

  2. Get pre-build docker image

docker pull cszhzleo/qwen2-vllm-neuron:v2.20.1
docker tag cszhzleo/qwen2-vllm-neuron:v2.20.1 neuron-container:qwen2
  1. Run container Before running test, check whether map path is existing.
docker run --rm --name neuron_vllm --shm-size=50gb \
--device /dev/neuron0 \
-p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server --model=Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=24 --max-model-len=1024 --block-size=1024
  1. Test
curl -X POST -H "Content-Type: application/json" http://localhost:8000/v1/completions \
 -d '{"model":"Qwen/Qwen2.5-0.5B-Instruct","prompt": "tell me a story about New York city","max_tokens": 100, "stream":false}'

Customize docker image

Navigate to artifacts directory

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
mkdir ~/artifacts/install
cd ~/artifacts/install
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch

cd ~/artifacts
cp arg_utils.py ./install/vllm/vllm/engine/
cp setup.py ./install/vllm/
cp neuron.py ./install/vllm/vllm/model_executor/model_loader/

# Build docker container
docker build -t neuron-container:qwen2 .

Tips

To reduce model download time, you can download model to local directory in advance. Here we use /home/ec2-user/environment/work/models to store model file and map it to container. The compiled NEFF files are save in the compiled directory, it can reduce model download and compile time during runtime.

docker run --rm --name neuron_vllm --shm-size=50gb \
    --device /dev/neuron0 -v /home/ec2-user/environment/work/models/:/models \
    -p 8000:8000 neuron-container:qwen2 python3 -m vllm.entrypoints.openai.api_server \
    --model=/models/Qwen/Qwen2.5-0.5B-Instruct --tensor-parallel-size=2 --max-num-seqs=1 \
    --max-model-len=1024 --block-size=1024

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published