|
| 1 | +# Calling multiple vLLM inference servers using LiteLLM |
| 2 | + |
| 3 | +In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Coheren, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers. |
| 4 | + |
| 5 | +## Introduction |
| 6 | + |
| 7 | +LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format. |
| 8 | +vLLM is a fast and easy-to-use library for LLM inference and serving. |
| 9 | +The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of silplicity, all 3 instances will reside in the same public subnet here. |
| 10 | + |
| 11 | +## vLLM inference servers deployment |
| 12 | + |
| 13 | +For each of the inference servers nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. |
| 14 | +The vLLM inference server is deployed using the vLLM official container image. |
| 15 | +``` |
| 16 | +docker run --gpus all \ |
| 17 | + -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \ |
| 18 | + --ipc=host \ |
| 19 | + vllm/vllm-openai:latest \ |
| 20 | + --host 0.0.0.0 \ |
| 21 | + --port 8000 \ |
| 22 | + --model mistralai/Mistral-7B-Instruct-v0.3 \ |
| 23 | + --tensor-parallel-size 2 \ |
| 24 | + --load-format safetensors \ |
| 25 | + --trust-remote-code \ |
| 26 | + --enforce-eager |
| 27 | +``` |
| 28 | +where `$HF_TOKEN` is a valid HuggingFace token. In this case we use the 7B Instruct version of Mistral LLM. The vLLM endpoint can be directly called for verification with: |
| 29 | +``` |
| 30 | +curl http://localhost:8000/v1/chat/completions \ |
| 31 | + -H "Content-Type: application/json" \ |
| 32 | + -d '{ |
| 33 | + "model": "mistralai/Mistral-7B-Instruct-v0.3", |
| 34 | + "messages": [ |
| 35 | + {"role": "user", "content": "Who won the world series in 2020?"} |
| 36 | + ] |
| 37 | + }' | jq |
| 38 | +``` |
| 39 | + |
| 40 | +## LiteLLM server deployment |
| 41 | + |
| 42 | +No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`: |
| 43 | +``` |
| 44 | +pip install 'litellm[proxy]' |
| 45 | +``` |
| 46 | +Edit the `config.yaml` file (OpenAI-Compatible Endpoint): |
| 47 | +``` |
| 48 | +model_list: |
| 49 | + - model_name: Mistral-7B-Instruct |
| 50 | + litellm_params: |
| 51 | + model: openai/mistralai/Mistral-7B-Instruct-v0.3 |
| 52 | + api_base: http://public_ip_1:8000/v1 |
| 53 | + api_key: sk-0123456789 |
| 54 | + - model_name: Mistral-7B-Instruct |
| 55 | + litellm_params: |
| 56 | + model: openai/mistralai/Mistral-7B-Instruct-v0.3 |
| 57 | + api_base: http://public_ip_2:8000/v1 |
| 58 | + api_key: sk-0123456789 |
| 59 | +``` |
| 60 | +where `sk-0123456789` is a valid OpenAI API key and `public_ip_1` and `public_ip_2` are the two GPU instances public IP addresses. |
| 61 | + |
| 62 | +Start the LiteLLM Proxy Server with the following command: |
| 63 | +``` |
| 64 | +litellm --config /path/to/config.yaml |
| 65 | +``` |
| 66 | +Once the the Proxy Server is ready call the vLLM endpoint through LiteLLM with: |
| 67 | +``` |
| 68 | +curl http://localhost:4000/chat/completions \ |
| 69 | + -H 'Authorization: Bearer sk-0123456789' \ |
| 70 | + -H "Content-Type: application/json" \ |
| 71 | + -d '{ |
| 72 | + "model": "Mistral-7B-Instruct", |
| 73 | + "messages": [ |
| 74 | + {"role": "user", "content": "Who won the world series in 2020?"} |
| 75 | + ] |
| 76 | + }' | jq |
| 77 | +``` |
| 78 | + |
| 79 | +## Useful links |
| 80 | + |
| 81 | +* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible) |
| 82 | +* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html) |
| 83 | +* [MistralAI](https://mistral.ai/) |
0 commit comments