|
| 1 | +# Calling multiple vLLM inference servers using LiteLLM |
| 2 | + |
| 3 | +In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Cohere, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers. |
| 4 | + |
| 5 | +<!--  --> |
| 6 | + |
| 7 | +# When to use this asset? |
| 8 | + |
| 9 | +To run the inference tutorial with local deployments of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU and a LiteLLM Proxy Server on top. |
| 10 | + |
| 11 | +# How to use this asset? |
| 12 | + |
| 13 | +These are the prerequisites to run this tutorial: |
| 14 | +* An OCI tenancy with A10 quota |
| 15 | +* A Huggingface account with a valid Auth Token |
| 16 | +* A valid OpenAI API Key |
| 17 | + |
| 18 | +## Introduction |
| 19 | + |
| 20 | +LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format. |
| 21 | +vLLM is a fast and easy-to-use library for LLM inference and serving. |
| 22 | +The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of simplicity, all 3 instances will reside in the same public subnet here. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +## vLLM inference servers deployment |
| 27 | + |
| 28 | +For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. It is a good practice to deploy two instances in two different fault domains to ensure a higher availability. |
| 29 | + |
| 30 | +The vLLM inference server is deployed using the vLLM official container image. |
| 31 | +``` |
| 32 | +docker run --gpus all \ |
| 33 | + -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \ |
| 34 | + --ipc=host \ |
| 35 | + vllm/vllm-openai:latest \ |
| 36 | + --host 0.0.0.0 \ |
| 37 | + --port 8000 \ |
| 38 | + --model mistralai/Mistral-7B-Instruct-v0.3 \ |
| 39 | + --tensor-parallel-size 2 \ |
| 40 | + --load-format safetensors \ |
| 41 | + --trust-remote-code \ |
| 42 | + --enforce-eager |
| 43 | +``` |
| 44 | +where `$HF_TOKEN` is a valid HuggingFace token. In this case we use the 7B Instruct version of Mistral LLM. The vLLM endpoint can be directly called for verification with: |
| 45 | +``` |
| 46 | +curl http://localhost:8000/v1/chat/completions \ |
| 47 | + -H "Content-Type: application/json" \ |
| 48 | + -d '{ |
| 49 | + "model": "mistralai/Mistral-7B-Instruct-v0.3", |
| 50 | + "messages": [ |
| 51 | + {"role": "user", "content": "Who won the world series in 2020?"} |
| 52 | + ] |
| 53 | + }' | jq |
| 54 | +``` |
| 55 | + |
| 56 | +## LiteLLM server deployment |
| 57 | + |
| 58 | +No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.Flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`: |
| 59 | +``` |
| 60 | +pip install 'litellm[proxy]' |
| 61 | +``` |
| 62 | +Edit the `config.yaml` file (OpenAI-Compatible Endpoint): |
| 63 | +``` |
| 64 | +model_list: |
| 65 | + - model_name: Mistral-7B-Instruct |
| 66 | + litellm_params: |
| 67 | + model: openai/mistralai/Mistral-7B-Instruct-v0.3 |
| 68 | + api_base: http://xxx.xxx.xxx.xxx:8000/v1 |
| 69 | + api_key: sk-0123456789 |
| 70 | + - model_name: Mistral-7B-Instruct |
| 71 | + litellm_params: |
| 72 | + model: openai/mistralai/Mistral-7B-Instruct-v0.3 |
| 73 | + api_base: http://xxx.xxx.xxx.xxx:8000/v1 |
| 74 | + api_key: sk-0123456789 |
| 75 | +``` |
| 76 | +where `sk-0123456789` is a valid OpenAI API key and `xxx.xxx.xxx.xxx` are the two GPU instances public IP addresses. |
| 77 | + |
| 78 | +Start the LiteLLM Proxy Server with the following command: |
| 79 | +``` |
| 80 | +litellm --config /path/to/config.yaml |
| 81 | +``` |
| 82 | +Once the the Proxy Server is ready call the vLLM endpoint through LiteLLM with: |
| 83 | +``` |
| 84 | +curl http://localhost:4000/chat/completions \ |
| 85 | + -H 'Authorization: Bearer sk-0123456789' \ |
| 86 | + -H "Content-Type: application/json" \ |
| 87 | + -d '{ |
| 88 | + "model": "Mistral-7B-Instruct", |
| 89 | + "messages": [ |
| 90 | + {"role": "user", "content": "Who won the world series in 2020?"} |
| 91 | + ] |
| 92 | + }' | jq |
| 93 | +``` |
| 94 | + |
| 95 | +## Documentation |
| 96 | + |
| 97 | +* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible) |
| 98 | +* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html) |
| 99 | +* [MistralAI](https://mistral.ai/) |
0 commit comments