Skip to content

Commit cde0a9c

Browse files
initial commit
1 parent 2f16960 commit cde0a9c

File tree

2 files changed

+94
-0
lines changed

2 files changed

+94
-0
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Calling multiple vLLM inference servers using LiteLLM
2+
3+
In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Coheren, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers.
4+
5+
## Introduction
6+
7+
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
8+
vLLM is a fast and easy-to-use library for LLM inference and serving.
9+
The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of silplicity, all 3 instances will reside in the same public subnet here.
10+
11+
## vLLM inference servers deployment
12+
13+
For each of the inference servers nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled.
14+
The vLLM inference server is deployed using the vLLM official container image.
15+
```
16+
docker run --gpus all \
17+
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
18+
--ipc=host \
19+
vllm/vllm-openai:latest \
20+
--host 0.0.0.0 \
21+
--port 8000 \
22+
--model mistralai/Mistral-7B-Instruct-v0.3 \
23+
--tensor-parallel-size 2 \
24+
--load-format safetensors \
25+
--trust-remote-code \
26+
--enforce-eager
27+
```
28+
where `$HF_TOKEN` is a valid HuggingFace token. In this case we use the 7B Instruct version of Mistral LLM. The vLLM endpoint can be directly called for verification with:
29+
```
30+
curl http://localhost:8000/v1/chat/completions \
31+
-H "Content-Type: application/json" \
32+
-d '{
33+
"model": "mistralai/Mistral-7B-Instruct-v0.3",
34+
"messages": [
35+
{"role": "user", "content": "Who won the world series in 2020?"}
36+
]
37+
}' | jq
38+
```
39+
40+
## LiteLLM server deployment
41+
42+
No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`:
43+
```
44+
pip install 'litellm[proxy]'
45+
```
46+
Edit the `config.yaml` file (OpenAI-Compatible Endpoint):
47+
```
48+
model_list:
49+
- model_name: Mistral-7B-Instruct
50+
litellm_params:
51+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
52+
api_base: http://public_ip_1:8000/v1
53+
api_key: sk-0123456789
54+
- model_name: Mistral-7B-Instruct
55+
litellm_params:
56+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
57+
api_base: http://public_ip_2:8000/v1
58+
api_key: sk-0123456789
59+
```
60+
where `sk-0123456789` is a valid OpenAI API key and `public_ip_1` and `public_ip_2` are the two GPU instances public IP addresses.
61+
62+
Start the LiteLLM Proxy Server with the following command:
63+
```
64+
litellm --config /path/to/config.yaml
65+
```
66+
Once the the Proxy Server is ready call the vLLM endpoint through LiteLLM with:
67+
```
68+
curl http://localhost:4000/chat/completions \
69+
-H 'Authorization: Bearer sk-0123456789' \
70+
-H "Content-Type: application/json" \
71+
-d '{
72+
"model": "Mistral-7B-Instruct",
73+
"messages": [
74+
{"role": "user", "content": "Who won the world series in 2020?"}
75+
]
76+
}' | jq
77+
```
78+
79+
## Useful links
80+
81+
* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible)
82+
* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)
83+
* [MistralAI](https://mistral.ai/)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
model_list:
2+
- model_name: Mistral-7B-Instruct
3+
litellm_params:
4+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
5+
api_base: http://public_ip_1:8000/v1
6+
api_key: sk-0123456789
7+
- model_name: Mistral-7B-Instruct
8+
litellm_params:
9+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
10+
api_base: http://public_ip_2:8000/v1
11+
api_key: sk-0123456789

0 commit comments

Comments
 (0)