1.1 Install Docker
Note: This assumes all other necessary Habana SW is already installed.
# install
sudo apt install -y habanalabs-container-runtime
# register
sudo tee /etc/docker/daemon.json <<EOF
{
"runtimes": {
"habana": {
"path": "/usr/bin/habana-container-runtime",
"runtimeArgs": []
}
}
}
EOFReference: https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html
Follow up to step 3 in this README: https://github.com/HabanaAI/Gaudi-tutorials/tree/main/PyTorch/vLLM_Tutorials/Deploying_vLLM
git clone https://github.com/vllm-project/vllm.git && cd vllm
git checkout v0.9.1
docker build -f docker/Dockerfile.cpu -t vllm-cpu-perf --shm-size=4g .Other Benchmarking BKMs: Benchmark Guide for CPU
RouteLLM Public Repo: https://github.com/lm-sys/RouteLLM/tree/main
RouteLLM Blog: https://lmsys.org/blog/2024-07-01-routellm/
RouteLLM Internal Repo: https://github.com/intel-innersource/applications.ai.iaas.dse-iaas/tree/main
Clone the internal repo and navigate to the RouteLLM directory:
git clone https://github.com/intel-innersource/applications.ai.iaas.dse-iaas.git cd applications.ai.iaas.dse-iaas/routellm
The following has been tested on an IBM Cloud Gaudi3 system:
- 8xGaudi3
- EMR host node - 40c per socket, 2 socket, 1.7TB mem
- OS: Ubuntu 22.04.5 LTS
NOTE: Make sure to update the yaml file with your HuggingFace auth token as well as the hf_cache and recipe_cache paths.
docker compose -f endpoints_and_routing-compose.yaml up --build