You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To run the inference tutorial with local deployments of Mistral 7B Instruct v0.3 using a vLLM inference server powered by an NVIDIA A10 GPU and a LiteLLM Proxy Server on top.
10
+
11
+
# How to use this asset?
12
+
13
+
These are the prerequisites to run this tutorial:
14
+
* An OCI tenancy with A10 quota
15
+
* A Huggingface account with a valid Auth Token
16
+
* A valid OpenAI API Key
17
+
7
18
## Introduction
8
19
9
20
LiteLLM provides a proxy server to manage auth, loadbalancing, and spend tracking across 100+ LLMs. All in the OpenAI format.
@@ -14,7 +25,8 @@ The first step will be to deploy two vLLM inference servers on NVIDIA A10 powere
14
25
15
26
## vLLM inference servers deployment
16
27
17
-
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled.
28
+
For each of the inference nodes a VM.GPU.A10.2 instance (2 x NVIDIA A10 GPU 24GB) is used in combination with the NVIDIA GPU-Optimized VMI image from the OCI marketplace. This Ubuntu-based image comes with all the necessary libraries (Docker, NVIDIA Container Toolkit) preinstalled. It is a good practice to deploy two instances in two different fault domains to ensure a higher availability.
29
+
18
30
The vLLM inference server is deployed using the vLLM official container image.
0 commit comments