Skip to content

Commit 7fd6bfe

Browse files
authored
Merge pull request #356 from tsikiksr/master
LLM folder and quantization sample
2 parents 19ae8e7 + be1b533 commit 7fd6bfe

File tree

2 files changed

+340
-0
lines changed

2 files changed

+340
-0
lines changed

LLM/Quantization/readme.md

Lines changed: 325 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,325 @@
1+
# Quantizing Llama 2 70B
2+
This sample provides a step-by-step walkthrough on quantizing a Llama 2 70B model to 4 bit weights, so it can fit 2xA10 GPUs.
3+
The sample uses GPTQ in HuggingFace transformer to reduce the weights parameters to 4 bit so the model weighs about 35GB in memory.
4+
5+
## Prerequisites
6+
* [Create an object storage bucket](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#2-object-storage) - to save the quantized model to the mdoel catalog
7+
* [Set the policies](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#3-oci-policies) - to allow the OCI Data Science Service resources to access object storage buckets, networking and others
8+
* [Notebook session](https://docs.oracle.com/en-us/iaas/data-science/using/manage-notebook-sessions.htm) - to run this sample. Use the **VM.GPU.A10.2** shape for the notebook session.
9+
* [Access token from HuggingFace](https://huggingface.co/docs/hub/security-tokens) to download Llama2 model. The pre-trained model can be obtained from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/models?sort=trending&search=meta-llama%2Fllama-2). In this example, we will use the [HuggingFace access token](https://huggingface.co/docs/hub/security-tokens) to download the pre-trained model from HuggingFace (by setting the __HUGGING_FACE_HUB_TOKEN__ environment variable).
10+
* Log in to HuggingFace with the auth token:
11+
* Open a terminal window in the notebook session
12+
* enter the command line: huggingface-cli login
13+
* paste the auth token
14+
* see more information [here](https://huggingface.co/docs/huggingface_hub/quick-start#login)
15+
* Install required python libraries (from terminal window):
16+
```python
17+
pip install "transformers[sentencepiece]==4.32.1" "optimum==1.12.0" "auto-gptq==0.4.2" "accelerate==0.22.0" "safetensors>=0.3.1" --upgrade
18+
```
19+
20+
## Load the full model
21+
We can load the full model using the device_map="auto" argument. This will use CPU to store the weights that cannot be loaded into the GPUs.
22+
23+
```python
24+
import torch
25+
from transformers import AutoModelForCausalLM, AutoTokenizer
26+
27+
model_id = "meta-llama/Llama-2-70b-hf"
28+
29+
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
30+
31+
model_full = AutoModelForCausalLM.from_pretrained(
32+
model_id,
33+
low_cpu_mem_usage=True,
34+
torch_dtype=torch.float16,
35+
device_map="auto",
36+
)
37+
```
38+
39+
By looking at the device map, we can see many layers are loaded into the CPU memory.
40+
```python
41+
model_full.hf_device_map
42+
```
43+
44+
<details>
45+
<summary>Full 70B model device map on A10.2</summary>
46+
47+
{'model.embed_tokens': 0,
48+
'model.layers.0': 1,
49+
'model.layers.1': 1,
50+
'model.layers.2': 1,
51+
'model.layers.3': 1,
52+
'model.layers.4': 1,
53+
'model.layers.5': 'cpu',
54+
'model.layers.6': 'cpu',
55+
'model.layers.7': 'cpu',
56+
'model.layers.8': 'cpu',
57+
'model.layers.9': 'cpu',
58+
'model.layers.10': 'cpu',
59+
'model.layers.11': 'cpu',
60+
'model.layers.12': 'cpu',
61+
'model.layers.13': 'cpu',
62+
'model.layers.14': 'cpu',
63+
'model.layers.15': 'cpu',
64+
'model.layers.16': 'cpu',
65+
'model.layers.17': 'cpu',
66+
'model.layers.18': 'cpu',
67+
'model.layers.19': 'cpu',
68+
'model.layers.20': 'cpu',
69+
'model.layers.21': 'cpu',
70+
'model.layers.22': 'cpu',
71+
'model.layers.23': 'cpu',
72+
'model.layers.24': 'cpu',
73+
'model.layers.25': 'cpu',
74+
'model.layers.26': 'cpu',
75+
'model.layers.27': 'cpu',
76+
'model.layers.28': 'cpu',
77+
'model.layers.29': 'cpu',
78+
'model.layers.30': 'cpu',
79+
'model.layers.31': 'cpu',
80+
'model.layers.32': 'cpu',
81+
'model.layers.33': 'cpu',
82+
'model.layers.34': 'cpu',
83+
'model.layers.35': 'cpu',
84+
'model.layers.36': 'cpu',
85+
'model.layers.37': 'cpu',
86+
'model.layers.38': 'cpu',
87+
'model.layers.39': 'cpu',
88+
'model.layers.40': 'cpu',
89+
'model.layers.41': 'cpu',
90+
'model.layers.42': 'cpu',
91+
'model.layers.43': 'cpu',
92+
'model.layers.44': 'cpu',
93+
'model.layers.45': 'cpu',
94+
'model.layers.46': 'cpu',
95+
'model.layers.47': 'cpu',
96+
'model.layers.48': 'cpu',
97+
'model.layers.49': 'cpu',
98+
'model.layers.50': 'cpu',
99+
'model.layers.51': 'cpu',
100+
'model.layers.52': 'cpu',
101+
'model.layers.53': 'cpu',
102+
'model.layers.54': 'cpu',
103+
'model.layers.55': 'cpu',
104+
'model.layers.56': 'cpu',
105+
'model.layers.57': 'cpu',
106+
'model.layers.58': 'cpu',
107+
'model.layers.59': 'cpu',
108+
'model.layers.60': 'cpu',
109+
'model.layers.61': 'cpu',
110+
'model.layers.62': 'cpu',
111+
'model.layers.63': 'cpu',
112+
'model.layers.64': 'cpu',
113+
'model.layers.65': 'cpu',
114+
'model.layers.66': 'cpu',
115+
'model.layers.67': 'cpu',
116+
'model.layers.68': 'cpu',
117+
'model.layers.69': 'cpu',
118+
'model.layers.70': 'cpu',
119+
'model.layers.71': 'cpu',
120+
'model.layers.72': 'cpu',
121+
'model.layers.73': 'cpu',
122+
'model.layers.74': 'cpu',
123+
'model.layers.75': 'cpu',
124+
'model.layers.76': 'cpu',
125+
'model.layers.77': 'cpu',
126+
'model.layers.78': 'cpu',
127+
'model.layers.79': 'cpu',
128+
'model.norm': 'cpu',
129+
'lm_head': 'cpu'}
130+
131+
</details>
132+
133+
## Quantize the model
134+
It is possible to quantize the model on the A10.2 shapes. However, the max sequence length is limited due to the GPU RAM available.
135+
Quantization requires a dataset to calibrate the quantized model. In this example we use the 'wikitext2' dataset.
136+
137+
We need to specify the maximum memory for the GPUs to load the model as we need to keep some extra memory for the quantization. Here we are specifying max_memory of 5GB for each GPU when loading the model.
138+
Due to the size of the model and the limited memory on A10, for the quantization, we need to limit the maximum sequence length that the model can take (model_seqlen) to 128. You may increase this number by reducing the max_memory used by each CPU when loading the model.
139+
We also need to set max_split_size_mb for PyTorch to reduce fragmentation.
140+
141+
The following parameters have been found to work when quantizing on A10.2:
142+
max_split_size_mb = 512
143+
max_memory = 5GB (for each GPU)
144+
model_seqlen = 128
145+
146+
```python
147+
import os
148+
import torch
149+
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
150+
151+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
152+
153+
model_id = "meta-llama/Llama-2-70b-hf"
154+
dataset_id = "wikitext2"
155+
156+
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
157+
gptq_config = GPTQConfig(bits=4, dataset=dataset_id, tokenizer=tokenizer, model_seqlen=128)
158+
159+
model_quantized = AutoModelForCausalLM.from_pretrained(
160+
model_id,
161+
torch_dtype=torch.float16,
162+
device_map="auto",
163+
max_memory={0: "5GB", 1: "5GB", "cpu": "400GB"},
164+
quantization_config=gptq_config,
165+
)
166+
```
167+
168+
The process will show the progress over the 80 layers. It will take about 1 hour with the wikitext2 dataset.
169+
170+
Note that we cannot run inference on this particular "quantized model" as some "blocks" are loaded across multiple devices. For inferencing, we need to save the model and load it back.
171+
172+
Save the quantized model and the tokenizer:
173+
```python
174+
save_folder = "Llama-2-70b-hf-quantized"
175+
model_quantized.save_pretrained(save_folder)
176+
tokenizer.save_pretrained(save_folder)
177+
```
178+
179+
Since the model was partially offloaded it set disable_exllama to True to avoid an error. For inference and production load we want to leverage the exllama kernels. Therefore we need to change the config.json:
180+
Edit the config.json file, find the key 'disable_exllama' and set it to 'False'.
181+
182+
## Working with the quantized model
183+
Now that the model is saved to the disk, we can see that its size is about 34.1GB. That aligns with our calculations, and can fit 2xA10 GPUs, which can handle up to 48GB in memory.
184+
185+
We can load the quantized model back without the max_memory limit:
186+
```python
187+
tokenizer_q = AutoTokenizer.from_pretrained(save_folder)
188+
model_quantized = AutoModelForCausalLM.from_pretrained(
189+
save_folder,
190+
device_map="auto",
191+
)
192+
```
193+
194+
By checking the device map, we see the entire model is loaded into GPUs:
195+
<details>
196+
<summary>Quantized model device map on A10.2</summary>
197+
198+
{'model.embed_tokens': 0,
199+
'model.layers.0': 0,
200+
'model.layers.1': 0,
201+
'model.layers.2': 0,
202+
'model.layers.3': 0,
203+
'model.layers.4': 0,
204+
'model.layers.5': 0,
205+
'model.layers.6': 0,
206+
'model.layers.7': 0,
207+
'model.layers.8': 0,
208+
'model.layers.9': 0,
209+
'model.layers.10': 0,
210+
'model.layers.11': 0,
211+
'model.layers.12': 0,
212+
'model.layers.13': 0,
213+
'model.layers.14': 0,
214+
'model.layers.15': 0,
215+
'model.layers.16': 0,
216+
'model.layers.17': 0,
217+
'model.layers.18': 0,
218+
'model.layers.19': 0,
219+
'model.layers.20': 0,
220+
'model.layers.21': 0,
221+
'model.layers.22': 0,
222+
'model.layers.23': 0,
223+
'model.layers.24': 0,
224+
'model.layers.25': 0,
225+
'model.layers.26': 0,
226+
'model.layers.27': 0,
227+
'model.layers.28': 0,
228+
'model.layers.29': 0,
229+
'model.layers.30': 0,
230+
'model.layers.31': 0,
231+
'model.layers.32': 0,
232+
'model.layers.33': 0,
233+
'model.layers.34': 0,
234+
'model.layers.35': 0,
235+
'model.layers.36': 0,
236+
'model.layers.37': 0,
237+
'model.layers.38': 1,
238+
'model.layers.39': 1,
239+
'model.layers.40': 1,
240+
'model.layers.41': 1,
241+
'model.layers.42': 1,
242+
'model.layers.43': 1,
243+
'model.layers.44': 1,
244+
'model.layers.45': 1,
245+
'model.layers.46': 1,
246+
'model.layers.47': 1,
247+
'model.layers.48': 1,
248+
'model.layers.49': 1,
249+
'model.layers.50': 1,
250+
'model.layers.51': 1,
251+
'model.layers.52': 1,
252+
'model.layers.53': 1,
253+
'model.layers.54': 1,
254+
'model.layers.55': 1,
255+
'model.layers.56': 1,
256+
'model.layers.57': 1,
257+
'model.layers.58': 1,
258+
'model.layers.59': 1,
259+
'model.layers.60': 1,
260+
'model.layers.61': 1,
261+
'model.layers.62': 1,
262+
'model.layers.63': 1,
263+
'model.layers.64': 1,
264+
'model.layers.65': 1,
265+
'model.layers.66': 1,
266+
'model.layers.67': 1,
267+
'model.layers.68': 1,
268+
'model.layers.69': 1,
269+
'model.layers.70': 1,
270+
'model.layers.71': 1,
271+
'model.layers.72': 1,
272+
'model.layers.73': 1,
273+
'model.layers.74': 1,
274+
'model.layers.75': 1,
275+
'model.layers.76': 1,
276+
'model.layers.77': 1,
277+
'model.layers.78': 1,
278+
'model.layers.79': 1,
279+
'model.norm': 1,
280+
'lm_head': 1}
281+
282+
</details>
283+
284+
## Testing the model
285+
We can use the Huggingface pipeline to test the model inference.
286+
287+
```python
288+
import time
289+
from transformers import pipeline
290+
291+
def generate(prompt, model, tokenizer, **kwargs):
292+
"""Creates a text generation pipeline, generate the completion and track the time used for the generation."""
293+
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, return_full_text=False)
294+
295+
# warm up
296+
generator("How are you?")
297+
generator("Oracle is a great company.")
298+
299+
time_started = time.time()
300+
completion = generator(prompt)[0]['generated_text']
301+
seconds_used = time.time() - time_started
302+
print(completion)
303+
num_tokens = len(completion.split())
304+
latency = seconds_used*1000 / num_tokens
305+
token_per_sec = len(generator.tokenizer(completion)["input_ids"]) / seconds_used
306+
print(f"******\nTotal time: {seconds_used:.3f} \nNumber of tokens: {num_tokens} \nseconds \nThroughput: {token_per_sec:.2f} Tokens/sec \nLatency: {latency:.2f} ms/token")
307+
```
308+
309+
Test the quantized model:
310+
generate("What's AI?", model_quantized, tokenizer_q)
311+
312+
Output:
313+
The AI 101 series is a collection of articles that will introduce you to the basics of artificial intelligence (AI). In this first article, we're going to talk about the history of AI, and how it has evolved over the years.
314+
The first AI system was created in the 1950s, and it was called the Logic Theorist. This system was able to solve mathematical problems using a set of rules. The Logic Theorist was followed by other AI systems, such as the General Problem Solver and the Game of Checkers.
315+
In the 1960s, AI researchers began to focus on developing systems that could understand natural language. This led to the development of the first chatbot, named ELIZA. ELIZA was able to hold a conversation with a human user by responding to their questions with pre-programmed responses.
316+
In the 1970s, AI researchers began to focus on developing systems that could learn from data. This led to the development of the first expert system, named MYCIN. MYCIN was able to diagnose diseases by analyzing data from medical records.
317+
******
318+
Time used: 28.659 seconds
319+
320+
Number of tokens: 176
321+
322+
Throughput: 9.00 Tokens/sec
323+
324+
Latency: 111.11 ms/token
325+

LLM/readme.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Large Language Models in OCI Data Science
2+
3+
OCI Data Science can be used to fine-tune, deploy, and manage Large Langugage Models (LLMs) effectively, efficiently, and easily.
4+
This page curates the links to some common use cases for LLMs.
5+
6+
[Fine tune Llama 2 with distributed multi-node, multi-GPU job](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training/llama2)
7+
8+
[Quantize Llama 2 70B to 4 bits and deploy on 2xA10s](tbd)
9+
10+
[Deploy Llama 2 on fully service managed deployment using TGI or vLLM](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/master/model-deployment/containers/llama2)
11+
12+
[Deploy Mistral 7B](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/model-deployment/containers/llm/mistral)
13+
14+
[Deploy GPT-2 using NVidia Triton Inference Server](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/model-deployment/containers/Triton/gpt2_ensemble/Deploy_GPT2_Ensemble.md)
15+

0 commit comments

Comments
 (0)