Skip to content

Commit 433d606

Browse files
Update README.md
1 parent 636372e commit 433d606

File tree

1 file changed

+14
-2
lines changed
  • cloud-infrastructure/ai-infra-gpu/ai-infrastructure/litellm

1 file changed

+14
-2
lines changed

cloud-infrastructure/ai-infra-gpu/ai-infrastructure/litellm/README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,17 @@ In this tutorial we explain how to use a LiteLLM Proxy Server to call multiple L
44

55
<!-- ![Hybrid shards](assets/images/litellm.png "LiteLLM") -->
66

7+
# When to use this asset?
8+
9+
To run the inference tutorial with local deployments of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU and a LiteLLM Proxy Server on top.
10+
11+
# How to use this asset?
12+
13+
These are the prerequisites to run this tutorial:
14+
* An OCI tenancy with A10 quota
15+
* A Huggingface account with a valid Auth Token
16+
* A valid OpenAI API Key
17+
718
## Introduction
819

920
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
@@ -14,7 +25,8 @@ The first step will be to deploy two vLLM inference servers on NVIDIA A10 powere
1425

1526
## vLLM inference servers deployment
1627

17-
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled.
28+
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. It is a good practice to deploy two instances in two different fault domains to ensure a higher availability.
29+
1830
The vLLM inference server is deployed using the vLLM official container image.
1931
```
2032
docker run --gpus all \
@@ -80,7 +92,7 @@ curl http://localhost:4000/chat/completions \
8092
}' | jq
8193
```
8294

83-
## Useful links
95+
## Documentation
8496

8597
* [LiteLLM documentation](https://litellm.vercel.app/docs/providers/openai_compatible)
8698
* [vLLM documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)

0 commit comments

Comments
 (0)