Skip to content

Commit d5c5b30

Browse files
Merge branch 'main' into mfioramo-patch-2
2 parents 2895a46 + 718b733 commit d5c5b30

File tree

155 files changed

+2105
-2204
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

155 files changed

+2105
-2204
lines changed
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Calling multiple vLLM inference servers using LiteLLM
2+
3+
In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple LLM inference endpoints from a single interface. LiteLLM interacts will 100+ LLMs such as OpenAI, Cohere, NVIDIA Triton and NIM, etc. Here we will use two vLLM inference servers.
4+
5+
<!-- ![Hybrid shards](assets/images/litellm.png "LiteLLM") -->
6+
7+
# When to use this asset?
8+
9+
To run the inference tutorial with local deployments of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU and a LiteLLM Proxy Server on top.
10+
11+
# How to use this asset?
12+
13+
These are the prerequisites to run this tutorial:
14+
* An OCI tenancy with A10 quota
15+
* A Huggingface account with a valid Auth Token
16+
* A valid OpenAI API Key
17+
18+
## Introduction
19+
20+
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
21+
vLLM is a fast and easy-to-use library for LLM inference and serving.
22+
The first step will be to deploy two vLLM inference servers on NVIDIA A10 powered virtual machine instances. In the second step, we will create a LiteLLM Proxy Server on a third no-GPU instance and explain how we can use this interface to call the two LLM from a single location. For the sake of simplicity, all 3 instances will reside in the same public subnet here.
23+
24+
![Hybrid shards](assets/images/litellm-architecture.png "LiteLLM")
25+
26+
## vLLM inference servers deployment
27+
28+
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. It is a good practice to deploy two instances in two different fault domains to ensure a higher availability.
29+
30+
The vLLM inference server is deployed using the vLLM official container image.
31+
```
32+
docker run --gpus all \
33+
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
34+
--ipc=host \
35+
vllm/vllm-openai:latest \
36+
--host 0.0.0.0 \
37+
--port 8000 \
38+
--model mistralai/Mistral-7B-Instruct-v0.3 \
39+
--tensor-parallel-size 2 \
40+
--load-format safetensors \
41+
--trust-remote-code \
42+
--enforce-eager
43+
```
44+
where `$HF_TOKEN` is a valid HuggingFace token. In this case we use the 7B Instruct version of Mistral LLM. The vLLM endpoint can be directly called for verification with:
45+
```
46+
curl http://localhost:8000/v1/chat/completions \
47+
-H "Content-Type: application/json" \
48+
-d '{
49+
"model": "mistralai/Mistral-7B-Instruct-v0.3",
50+
"messages": [
51+
{"role": "user", "content": "Who won the world series in 2020?"}
52+
]
53+
}' | jq
54+
```
55+
56+
## LiteLLM server deployment
57+
58+
No GPU are required for LiteLLM. Therefore, a CPU based VM.Standard.E4.Flex instance (4 OCPUs, 64 GB Memory) with a standard Ubuntu 22.04 image is used. Here LiteLLM is used as a proxy server calling a vLLM endpoint. Install LiteLLM using `pip`:
59+
```
60+
pip install 'litellm[proxy]'
61+
```
62+
Edit the `config.yaml` file (OpenAI-Compatible Endpoint):
63+
```
64+
model_list:
65+
- model_name: Mistral-7B-Instruct
66+
litellm_params:
67+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
68+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
69+
api_key: sk-0123456789
70+
- model_name: Mistral-7B-Instruct
71+
litellm_params:
72+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
73+
api_base: http://xxx.xxx.xxx.xxx:8000/v1
74+
api_key: sk-0123456789
75+
```
76+
where `sk-0123456789` is a valid OpenAI API key and `xxx.xxx.xxx.xxx` are the two GPU instances public IP addresses.
77+
78+
Start the LiteLLM Proxy Server with the following command:
79+
```
80+
litellm --config /path/to/config.yaml
81+
```
82+
Once the the Proxy Server is ready call the vLLM endpoint through LiteLLM with:
83+
```
84+
curl http://localhost:4000/chat/completions \
85+
-H 'Authorization: Bearer sk-0123456789' \
86+
-H "Content-Type: application/json" \
87+
-d '{
88+
"model": "Mistral-7B-Instruct",
89+
"messages": [
90+
{"role": "user", "content": "Who won the world series in 2020?"}
91+
]
92+
}' | jq
93+
```
94+
95+
## Documentation
96+
97+
* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible)
98+
* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)
99+
* [MistralAI](https://mistral.ai/)
24.6 KB
Loading
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
model_list:
2+
- model_name: Mistral-7B-Instruct
3+
litellm_params:
4+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
5+
api_base: http://public_ip_1:8000/v1
6+
api_key: sk-0123456789
7+
- model_name: Mistral-7B-Instruct
8+
litellm_params:
9+
model: openai/mistralai/Mistral-7B-Instruct-v0.3
10+
api_base: http://public_ip_2:8000/v1
11+
api_key: sk-0123456789
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Private Cloud and Edge
2+
3+
## Useful Links
4+
5+
- [Oracle Compute Cloud@Customer](https://www.oracle.com/uk/cloud/compute/cloud-at-customer/)
6+
- [Roving Edge Infrastructure](https://www.oracle.com/uk/cloud/roving-edge-infrastructure/)
7+
8+
## License
9+
10+
Copyright (c) 2024 Oracle and/or its affiliates.
11+
12+
Licensed under the Universal Permissive License (UPL), Version 1.0.
13+
14+
See [LICENSE](https://github.com/oracle-devrel/technology-engineering/blob/main/LICENSE) for more details.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# C3 Hosting Service Provider - IAM Policies for Isolation
2+
3+
The Hosting Service Provider (HSP) model on Compute Cloud@Customer (C3) allows
4+
hosting for multiple end customers, each isolated in a dedicated compartment
5+
with separate VCN(s) per customer. To ensure the end customer can only
6+
create resources in just their own compartment, a set of IAM policies are
7+
required.
8+
9+
The HSP documentation suggests the following policies per end customer
10+
based on an example with two hosting customers, A & B. They assume that
11+
each end customer will have two roles for their
12+
staff: Customer Administrator and Customer End User. 
13+
14+
## Example Policies for Customer Administrator
15+
```
16+
Allows the group specified to use all C3 services in the compartment
17+
listed:
18+
19+
Allow group CustA-Admin-grp to manage all-resources in compartment
20+
path:to:CustA
21+
22+
Allow group CustB-Admin-grp to manage all-resources in compartment
23+
path:to:CustB
24+
```
25+
Note that the above policy grants permissions in the CustA and CustB
26+
compartments of the C3 but **also in the same compartment in the OCI
27+
tenancy**! To prevent permissions being granted in the OCI tenancy
28+
append a condition such as:
29+
30+
```Allow group CustA-Admin-grp to manage all-resources in compartment
31+
path:to:CustA where all {request.region != 'LHR',request.region !=
32+
'FRA'}
33+
34+
Allow group CustB-Admin-grp to manage all-resources in compartment
35+
path:to:CustB where all {request.region != 'LHR',request.region !=
36+
'FRA'}
37+
```
38+
In the example above the condition prevents resource creation in London
39+
and Frankfurt regions. Adjust the list to include all regions the
40+
tenancy is subscribed to.
41+
42+
The path to the end user compartment must be explicitly stated, using
43+
the comma format, relative to the compartment where the policy is
44+
created. 
45+
46+
## Example Policies for Customer End User
47+
```
48+
Allow group CustA-Users-grp to manage instance-family in compartment
49+
path:to:CustA
50+
Allow group CustA-Users-grp to use volume-family in compartment
51+
path:to:CustA
52+
Allow group CustA-Users-grp to use virtual-network-family in compartment
53+
path:to:CustA
54+
Allow group CustB-Users-grp to manage instance-family in compartment
55+
path:to:CustB
56+
Allow group CustB-Users-grp to use volume-family in compartment
57+
path:to:CustB
58+
Allow group CustB-Users-grp to use virtual-network-family in compartment
59+
path:to:CustB
60+
```
61+
As above append a condition to limit permissions to the C3 and prevent
62+
resource creation in OCI regions:
63+
```
64+
Allow group CustA-Users-grp to manage instance-family in compartment
65+
path:to:CustA where all {request.region != 'LHR',request.region !=
66+
'FRA'}
67+
Allow group CustA-Users-grp to use volume-family in compartment
68+
path:to:CustA where all {request.region != 'LHR',request.region !=
69+
'FRA'}
70+
Allow group CustA-Users-grp to use virtual-network-family in compartment
71+
path:to:CustA where all {request.region != 'LHR',request.region !=
72+
'FRA'}
73+
Allow group CustB-Users-grp to manage instance-family in compartment
74+
path:to:CustB where all {request.region != 'LHR',request.region !=
75+
'FRA'}
76+
Allow group CustB-Users-grp to use volume-family in compartment
77+
path:to:CustB where all {request.region != 'LHR',request.region !=
78+
'FRA'}
79+
Allow group CustB-Users-grp to use virtual-network-family in compartment
80+
path:to:CustB where all {request.region != 'LHR',request.region !=
81+
'FRA'}
82+
```
83+
## Common Policy
84+
85+
Currently any user of a C3 needs access to certain resources located at
86+
the tenancy level to use IaaS resources in the web UI.
87+
Backup policies, tag namespaces, platform images, all reside at the
88+
tenancy level and need a further policy to allow normal use of C3 IaaS
89+
services. Note that this is a subtle difference to the behaviour on OCI. 
90+
91+
An extra policy as below is required (where CommonGroup contains **all**
92+
HSP users on the C3):
93+
```
94+
allow group CommonGroup to read all-resources in tenancy where
95+
target.compartment.name='root-compartment-name'
96+
```
97+

0 commit comments

Comments
 (0)