You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cloud-infrastructure/ai-infra-gpu/ai-infrastructure/litellm/README.md
+9-7Lines changed: 9 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,17 +2,19 @@
2
2
3
3
In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Cohere, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers.
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
10
10
vLLM is a fast and easy-to-use library for LLM inference and serving.
11
-
The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of silplicity, all 3 instances will reside in the same public subnet here.
11
+
The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of simplicity, all 3 instances will reside in the same public subnet here.
For each of the inference servers nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled.
17
+
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled.
16
18
The vLLM inference server is deployed using the vLLM official container image.
No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`:
46
+
No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.Flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`:
45
47
```
46
48
pip install 'litellm[proxy]'
47
49
```
@@ -51,15 +53,15 @@ model_list:
51
53
- model_name: Mistral-7B-Instruct
52
54
litellm_params:
53
55
model: openai/mistralai/Mistral-7B-Instruct-v0.3
54
-
api_base: http://public_ip_1:8000/v1
56
+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
55
57
api_key: sk-0123456789
56
58
- model_name: Mistral-7B-Instruct
57
59
litellm_params:
58
60
model: openai/mistralai/Mistral-7B-Instruct-v0.3
59
-
api_base: http://public_ip_2:8000/v1
61
+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
60
62
api_key: sk-0123456789
61
63
```
62
-
where `sk-0123456789` is a valid OpenAI API key and `public_ip_1` and `public_ip_2` are the two GPU instances public IP addresses.
64
+
where `sk-0123456789` is a valid OpenAI API key and `xxx.xxx.xxx.xxx` are the two GPU instances public IP addresses.
63
65
64
66
Start the LiteLLM Proxy Server with the following command:
0 commit comments