You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enterprise customers may prefer to deploy Llama 3 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 3 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference), two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 3 instances with [LangChain](https://www.langchain.com/), an open-source LLM app development framework which we used for our other demo apps: [Getting to Know Llama](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Getting_to_know_Llama.ipynb), Running Llama 3 [locally](https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/Running_Llama2_Anywhere/Running_Llama_on_Mac_Windows_Linux.ipynb) and [in the cloud](https://github.com/meta-llama/llama-recipes/blob/main/recipes/use_cases/RAG/HelloLlamaCloud.ipynb). See [here](https://medium.com/@rohit.k/tgi-vs-vllm-making-informed-choices-for-llm-deployment-37c56d7ff705) for a detailed comparison of vLLM and TGI.
3
4
4
-
Enterprise customers may prefer to deploy Llama 2 on-prem and run Llama in their own servers. This tutorial shows how to use Llama 2 with [vLLM](https://github.com/vllm-project/vllm) and Hugging Face [TGI](https://github.com/huggingface/text-generation-inference), two leading open-source tools to deploy and serve LLMs, and how to create vLLM and TGI hosted Llama 2 instances with [LangChain](https://www.langchain.com/), an open-source LLM app development framework which we used for our earlier demo apps with Llama 2 running on [local Mac](../../use_cases/RAG/HelloLlamaLocal.ipynb) or [Replicate cloud](../../use_cases/RAG/HelloLlamaCloud.ipynb). See [here](https://medium.com/@rohit.k/tgi-vs-vllm-making-informed-choices-for-llm-deployment-37c56d7ff705) for a detailed comparison of vLLM and TGI.
5
+
For [Ollama](https://ollama.com) based on-prem inference with Llama 3, see the Running Llama 3 locally notebook above.
5
6
6
-
We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an example of running vLLM and TGI with Llama 2, and you can replace this with your own server to implement on-prem Llama 2 deployment.
7
+
We'll use the Amazon EC2 instance running Ubuntu with an A10G 24GB GPU as an example of running vLLM and TGI with Llama 3, and you can replace this with your own server to implement on-prem Llama 3 deployment.
7
8
8
-
The Colab notebook to connect via LangChain with Llama 2 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg?usp=sharing), also shown in the sections below.
9
+
The Colab notebook to connect via LangChain with Llama 3 hosted as the vLLM and TGI API services is [here](https://colab.research.google.com/drive/1rYWLdgTGIU1yCHmRpAOB2D-84fPzmOJg), also shown in the sections below.
9
10
10
-
This tutorial assumes that you you have been granted access to the Meta Llama 2 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) to confirm that you see "Gated model You have been granted access to this model"; if you don't see the "granted access" message, simply follow the instructions under "Access Llama 2 on Hugging Face" in the page.
11
+
This tutorial assumes that you you have been granted access to the Meta Llama 3 on Hugging Face - you can open a Hugging Face Meta model page [here](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) to confirm that you see "Gated model You have been granted access to this model"; if you see "You need to agree to share your contact information to access this model", simply complete and submit the form in the page.
11
12
12
13
You'll also need your Hugging Face access token which you can get at your Settings page [here](https://huggingface.co/settings/tokens).
13
14
14
-
## Setting up vLLM with Llama 2
15
+
## Setting up vLLM with Llama 3
15
16
16
17
On a terminal, run the following commands:
17
18
18
19
```
19
-
conda create -n vllm python=3.8
20
-
conda activate vllm
20
+
conda create -n llama3 python=3.11
21
+
conda activate llama3
21
22
pip install vllm
22
23
```
23
24
24
25
Then run `huggingface-cli login` and copy and paste your Hugging Face access token to complete the login.
25
26
26
27
<!-- markdown-link-check-disable -->
27
-
There are two ways to deploy Llama 2 via vLLM, as a general API server or an OpenAI-compatible server (see [here](https://platform.openai.com/docs/api-reference/authentication) on how the OpenAI API authenticates, but you won't need to provide a real OpenAI API key when running Llama 2 via vLLM in the OpenAI-compatible mode).
28
+
There are two ways to deploy Llama 3 via vLLM, as a general API server or an OpenAI-compatible server (see [here](https://platform.openai.com/docs/api-reference/authentication) on how the OpenAI API authenticates, but you won't need to provide a real OpenAI API key when running Llama 3 via vLLM in the OpenAI-compatible mode).
28
29
<!-- markdown-link-check-enable -->
29
30
30
-
### Deploying Llama 2 as an API Server
31
+
### Deploying Llama 3 as an API Server
31
32
32
-
Run the command below to deploy vLLM as a general Llama 2 service:
33
+
Run the command below to deploy vLLM as a general Llama 3 service:
to send a query (prompt) to Llama 2 via vLLM and get Llama's response:
49
+
to send a query (prompt) to Llama 3 via vLLM and get Llama 3's response:
49
50
50
51
> Who wrote the book Innovators dilemma? The book "Innovator's Dilemma" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the "innovator's dilemma," which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline.
51
52
52
-
Now in your Llama client app, you can make an HTTP request as the `curl` command above to send a query to Llama and parse the response.
53
+
Now in your Llama 3 client app, you can make an HTTP request as the `curl` command above to send a query to Llama and parse the response.
53
54
54
55
If you add the port 5000 to your EC2 instance's security group's inbound rules with the TCP protocol, then you can run this on your Mac/Windows for test:
Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argument when starting the server (see [here](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html) for more info). For example, the command below runs the Llama 2 13b-chat model on 4 GPUs:
65
+
Also, if you have multiple GPUs, you can add the `--tensor-parallel-size` argument when starting the server (see [here](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html) for more info). For example, the command below runs the Llama 3 8b-instruct model on 4 GPUs:
With multiple GPUs, you can also run replica of models as long as your model size can fit into targeted GPU memory. For example, if you have two A10G with 24 GB memory, you can run two 7B Llama 2 models at the same time. This can be done by launching two api servers each targeting specific CUDA cores on different ports:
With multiple GPUs, you can also run replica of models as long as your model size can fit into targeted GPU memory. For example, if you have two A10G with 24 GB memory, you can run two Llama 3 8B models at the same time. This can be done by launching two api servers each targeting specific CUDA cores on different ports:
The benefit would be now you can balance incoming requests to both models, reaching higher batch size processing for a trade-off of generation latency.
The benefit would be that you can balance incoming requests to both models, reaching higher batch size processing for a trade-off of generation latency.
74
79
75
80
76
-
### Deploying Llama 2 as OpenAI-Compatible Server
81
+
### Deploying Llama 3 as OpenAI-Compatible Server
77
82
78
-
You can also deploy the vLLM hosted Llama 2 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:
83
+
You can also deploy the vLLM hosted Llama 3 as an OpenAI-Compatible service to easily replace code using OpenAI API. First, run the command below:
"prompt": "Who wrote the book Innovators dilemma?",
90
95
"max_tokens": 300,
91
96
"temperature": 0
@@ -95,15 +100,15 @@ and you'll see the following result:
95
100
96
101
> The book "Innovator's Dilemma" was written by Clayton M. Christensen. It was first published in 1997 and has since become a classic in the field of business and innovation. In the book, Christensen argues that successful companies often struggle to adapt to disruptive technologies and new market entrants, and that this struggle can lead to their downfall. He also introduces the concept of the "innovator's dilemma," which refers to the paradoxical situation in which a company's efforts to improve its existing products or services can actually lead to its own decline.
97
102
98
-
## Querying with Llama 2 via vLLM
103
+
## Querying with Llama 3 via vLLM
99
104
100
105
On a Google Colab notebook, first install two packages:
101
106
102
107
```
103
108
!pip install langchain openai
104
109
```
105
110
106
-
Note that we only need to install the `openai` package with an `EMPTY` OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 2.
111
+
Note that you only need to install the `openai` package with an `EMPTY` OpenAI API key to complete the LangChain integration with the OpenAI-compatible vLLM deployment of Llama 3.
107
112
108
113
Then replace the <vllm_server_ip_address> below and run the code:
109
114
@@ -113,10 +118,7 @@ from langchain.llms import VLLMOpenAI
> The book "The Godfather" was written by Mario Puzo. It was first published in 1969 and has since become a classic of American literature. The book was later adapted into a successful film directed by Francis Ford Coppola, which was released in 1972.
128
130
129
131
130
-
You can now use the Llama 2 instance `llm` created this way in any of the recipes or your own Llama apps to integrate seamlessly with LangChain and LlamaIndex to build powerful on-prem Llama apps.
132
+
You can now use the Llama 3 instance `llm` created this way in any of the demo apps or your own Llama 3 apps to integrate seamlessly with LangChain to build powerful on-prem Llama 3 apps.
131
133
132
-
## Setting Up TGI with Llama 2
134
+
## Setting Up TGI with Llama 3
133
135
134
-
The easiest way to deploy Llama 2 with TGI is using its official docker image. First, replace `<your Hugging Face access token>` and set the three required shell variables (you may replace the `model` value above with another Llama 2 model):
136
+
The easiest way to deploy Llama 3 with TGI is using its official docker image. First, replace `<your_hugging_face_access_token>` and set the three required shell variables (you may replace the `model` value above with another Llama 3 model):
135
137
136
138
```
137
-
model=meta-llama/Llama-2-13b-chat-hf
139
+
model=meta-llama/Meta-Llama-3-8B-Instruct
138
140
volume=$PWD/data
139
-
token=<your Hugging Face access token>
141
+
token=<your_hugging_face_access_token>
140
142
```
141
143
142
-
Then run the command below to deploy a quantized version of the Llama 2 13b-chat model with TGI:
144
+
Then run the command below to deploy a quantized version of the Llama 3 8b chat model with TGI:
143
145
144
146
```
145
-
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.2 --model-id $model --quantize bitsandbytes-nf4
147
+
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
146
148
```
147
149
148
150
After this, you'll be able to run the command below on another terminal:
curl 127.0.0.1:8080/generate_stream -X POST -H 'Content-Type: application/json' -d '{
164
+
"inputs": "Who wrote the book innovators dilemma?",
165
+
"parameters": {
166
+
"max_new_tokens":200
167
+
}
156
168
}'
157
169
```
158
170
159
-
and see the answer generated by Llama 2 via TGI:
171
+
and see the answer generated by Llama 3 via TGI like below:
160
172
161
173
> The book "The Innovator's Dilemma" was written by Clayton Christensen, a professor at Harvard Business School. It was first published in 1997 and has since become a widely recognized and influential book on the topic of disruptive innovation.
162
174
163
-
## Querying with Llama 2 via TGI
175
+
## Querying with Llama 3 via TGI
164
176
165
-
Using LangChain to integrate with TGI-hosted Llama 2 is also straightforward. In the Colab above, first add a new code cell to install the Hugging Face `text_generation` package:
177
+
Using LangChain to integrate with TGI-hosted Llama 3 is also straightforward. In the Colab above, first add a new code cell to install the Hugging Face `text_generation` package:
166
178
167
179
```
168
180
!pip install text_generation
@@ -171,6 +183,8 @@ Using LangChain to integrate with TGI-hosted Llama 2 is also straightforward. In
171
183
Then add and run the code below:
172
184
173
185
```
186
+
from langchain_community.llms import HuggingFaceTextGenInference
0 commit comments