Skip to content

Commit fe48196

Browse files
Merge pull request #1231 from oracle-devrel/litellm-tutorial
Litellm tutorial
2 parents 6439d2a + c32b6d5 commit fe48196

File tree

3 files changed

+110
-0
lines changed

3 files changed

+110
-0
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Calling multiple vLLM inference servers using LiteLLM
2+
3+
In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Cohere, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers.
4+
5+
<!-- ![Hybrid shards](assets/images/litellm.png "LiteLLM") -->
6+
7+
# When to use this asset?
8+
9+
To run the inference tutorial with local deployments of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU and a LiteLLM Proxy Server on top.
10+
11+
# How to use this asset?
12+
13+
These are the prerequisites to run this tutorial:
14+
* An OCI tenancy with A10 quota
15+
* A Huggingface account with a valid Auth Token
16+
* A valid OpenAI API Key
17+
18+
## Introduction
19+
20+
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
21+
vLLM is a fast and easy-to-use library for LLM inference and serving.
22+
The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of simplicity, all 3 instances will reside in the same public subnet here.
23+
24+
![Hybrid shards](assets/images/litellm-architecture.png "LiteLLM")
25+
26+
## vLLM inference servers deployment
27+
28+
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. It is a good practice to deploy two instances in two different fault domains to ensure a higher availability.
29+
30+
The vLLM inference server is deployed using the vLLM official container image.
31+
```
32+
docker run --gpus all \
33+
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
34+
--ipc=host \
35+
vllm/vllm-openai:latest \
36+
--host 0.0.0.0 \
37+
--port 8000 \
38+
--model mistralai/Mistral-7B-Instruct-v0.3 \
39+
--tensor-parallel-size 2 \
40+
--load-format safetensors \
41+
--trust-remote-code \
42+
--enforce-eager
43+
```
44+
where `$HF_TOKEN` is a valid HuggingFace token. In this case we use the 7B Instruct version of Mistral LLM. The vLLM endpoint can be directly called for verification with:
45+
```
46+
curl http://localhost:8000/v1/chat/completions \
47+
-H "Content-Type: application/json" \
48+
-d '{
49+
"model": "mistralai/Mistral-7B-Instruct-v0.3",
50+
"messages": [
51+
{"role": "user", "content": "Who won the world series in 2020?"}
52+
]
53+
}' | jq
54+
```
55+
56+
## LiteLLM server deployment
57+
58+
No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.Flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`:
59+
```
60+
pip install 'litellm[proxy]'
61+
```
62+
Edit the `config.yaml` file (OpenAI-Compatible Endpoint):
63+
```
64+
model_list:
65+
- model_name: Mistral-7B-Instruct
66+
litellm_params:
67+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
68+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
69+
api_key: sk-0123456789
70+
- model_name: Mistral-7B-Instruct
71+
litellm_params:
72+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
73+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
74+
api_key: sk-0123456789
75+
```
76+
where `sk-0123456789` is a valid OpenAI API key and `xxx.xxx.xxx.xxx` are the two GPU instances public IP addresses.
77+
78+
Start the LiteLLM Proxy Server with the following command:
79+
```
80+
litellm --config /path/to/config.yaml
81+
```
82+
Once the the Proxy Server is ready call the vLLM endpoint through LiteLLM with:
83+
```
84+
curl http://localhost:4000/chat/completions \
85+
-H 'Authorization: Bearer sk-0123456789' \
86+
-H "Content-Type: application/json" \
87+
-d '{
88+
"model": "Mistral-7B-Instruct",
89+
"messages": [
90+
{"role": "user", "content": "Who won the world series in 2020?"}
91+
]
92+
}' | jq
93+
```
94+
95+
## Documentation
96+
97+
* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible)
98+
* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)
99+
* [MistralAI](https://mistral.ai/)
24.6 KB
Loading
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
model_list:
2+
- model_name: Mistral-7B-Instruct
3+
litellm_params:
4+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
5+
api_base: http://public_ip_1:8000/v1
6+
api_key: sk-0123456789
7+
- model_name: Mistral-7B-Instruct
8+
litellm_params:
9+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
10+
api_base: http://public_ip_2:8000/v1
11+
api_key: sk-0123456789

0 commit comments

Comments
 (0)