Skip to content

Commit 3516a5c

Browse files
authored
Merge pull request meta-llama#2 from meta-llama/inference_changes
Inference/Finetuning changes
2 parents b15bad9 + 935f66f commit 3516a5c

File tree

11 files changed

+335
-171
lines changed

11 files changed

+335
-171
lines changed

.github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1406,5 +1406,12 @@ DLAI
14061406
agentic
14071407
containts
14081408
dlai
1409+
Prerequirements
1410+
tp
1411+
QLoRA
1412+
ntasks
1413+
srun
1414+
xH
1415+
unquantized
14091416
eom
1410-
ipython
1417+
ipython

docs/LLM_finetuning.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## LLM Fine-Tuning
22

3-
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
3+
Here we discuss fine-tuning Meta Llama with a couple of different recipes. We will cover two scenarios here:
44

55

66
## 1. **Parameter Efficient Model Fine-Tuning**
@@ -18,8 +18,6 @@ These methods will address three aspects:
1818

1919
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
2020

21-
22-
2321
## 2. **Full/ Partial Parameter Fine-Tuning**
2422

2523
Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

docs/multi_gpu.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,12 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
66

77
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
88

9-
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
9+
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 8B model on multiple GPUs in one node.
10+
For big models like 405B we will need to fine-tune in a multi-node setup even if 4bit quantization is enabled.
1011

1112
## Requirements
1213
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/quickstart/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
1314

14-
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
15-
1615
## How to run it
1716

1817
Get access to a machine with multiple GPUs ( in this case we tested with 4 A100 and A10s).
@@ -61,7 +60,7 @@ torchrun --nnodes 1 --nproc_per_node 8 recipes/quickstart/finetuning/finetuning
6160
This has been tested on 4 H100s GPUs.
6261

6362
```bash
64-
FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization int4 --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
63+
FSDP_CPU_RAM_EFFICIENT_LOADING=1 ACCELERATE_USE_FSDP=1 torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --quantization 4bit --model_name /path_of_model_folder/70B --mixed_precision False --low_cpu_fsdp --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
6564
```
6665

6766
### Fine-tuning using FSDP on 70B Model
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Llama inference with vLLM
2+
3+
This folder contains an example for running Llama inference on multiple-gpus in single- as well as multi-node scenarios using vLLM.
4+
5+
## Prerequirements
6+
7+
To run this example we will need to install vLLM as well as ray in case multi-node inference is the goal.
8+
9+
```bash
10+
pip install vllm
11+
12+
# For multi-node inference we also need to install ray
13+
pip install ray[default]
14+
```
15+
16+
For the following examples we will assume that we fine-tuned a base model using the LoRA method and we have setup the following environment variables pointing to the base model as well as LoRA adapter:
17+
18+
```bash
19+
export MODEL_PATH=/path/to/out/base/model
20+
export PEFT_MODEL_PATH=/path/to/out/peft/model
21+
```
22+
23+
## Single-node multi-gpu inference
24+
To launch the inference simply execute the following command changing the tp_size parameter to the numbers of GPUs you have available:
25+
26+
``` bash
27+
python inference.py --model_name $MODEL_PATH --peft_model_name $PEFT_MODEL_PATH --tp_size 8 --user_prompt "Hello my name is"
28+
```
29+
The script will ask for another prompt ina loop after completing the generation which you can exit by simply pressing enter and leaving the prompt empty.
30+
When using multiple gpus the model will automatically be split accross the available GPUs using tensor parallelism.
31+
32+
## Multi-node multi-gpu inference
33+
The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the script located in this folder.
34+
To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need multi-node inference.
35+
vLLM allows this by leveraging pipeline parallelism accros nodes while still applying tensor parallelism insid each node.
36+
To start a multi-node inference we first need to set up a ray serves which well be leveraged by vLLM to execute the model across node boundaries.
37+
38+
```bash
39+
# On the head node we start the clustr as follows
40+
ray start --head
41+
42+
# After the server starts it prints out a couple of lines including the command to add nodes to the cluster e.g.:
43+
# To add another node to this Ray cluster, run
44+
# ray start --address='<head-node-ip-address>:6379'
45+
# Where the head node ip address will depend on your environment
46+
47+
# We can then add the worker nodes by executing the command in a shell on the worker node
48+
ray start --address='<head-node-ip-address>:6379'
49+
50+
# We can check if the cluster was launched successfully by executing this on any node
51+
ray status
52+
53+
# It should show the number of nodes we have added as well as the head node
54+
# Node status
55+
# ---------------------------------------------------------------
56+
# Active:
57+
# 1 node_82143b740a25228c24dc8bb3a280b328910b2fcb1987eee52efb838b
58+
# 1 node_3f2c673530de5de86f953771538f35437ab60e3cacd7730dbca41719
59+
```
60+
61+
To launch the inference we can then execute the inference script while we adapt pp_size and tp_size to our environment.
62+
63+
```
64+
pp_size - number of worker + head nodes
65+
66+
tp_size - number of GPUs per node
67+
```
68+
69+
If our environment consists of two nodes with 8 GPUs each we would execute:
70+
```bash
71+
python inference.py --model_name $MODEL_PATH --peft_model_name $PEFT_MODEL_PATH --pp_size 2 --tp_size 8 --user_prompt "Hello my name is"
72+
```
73+
74+
The launch of the vLLM engine will take some time depending on your environment as each worker will need to load the checkpoint files to extract its fraction of the weights.
75+
and even if it seem to hang
Lines changed: 35 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
# Copyright (c) Meta Platforms, Inc. and affiliates.
22
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
33

4+
import uuid
5+
import asyncio
46
import fire
57

68
import torch
7-
from vllm import LLM
8-
from vllm import LLM, SamplingParams
9+
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
10+
from vllm.lora.request import LoRARequest
911
from accelerate.utils import is_xpu_available
1012

1113
if is_xpu_available():
@@ -15,13 +17,24 @@
1517

1618
torch.manual_seed(42)
1719

18-
def load_model(model_name, tp_size=1):
20+
def load_model(model_name, peft_model=None, pp_size=1, tp_size=1):
21+
additional_configs = {}
22+
if peft_model:
23+
additional_configs["enable_lora"] = True
24+
25+
engine_config = AsyncEngineArgs(
26+
model=model_name,
27+
pipeline_parallel_size=pp_size,
28+
tensor_parallel_size=tp_size,
29+
max_loras=1,
30+
**additional_configs)
1931

20-
llm = LLM(model_name, tensor_parallel_size=tp_size)
32+
llm = AsyncLLMEngine.from_engine_args(engine_config)
2133
return llm
2234

23-
def main(
35+
async def main(
2436
model,
37+
peft_model_name=None,
2538
max_new_tokens=100,
2639
user_prompt=None,
2740
top_p=0.9,
@@ -35,26 +48,36 @@ def main(
3548

3649
print(f"sampling params: top_p {top_p} and temperature {temperature} for this inference request")
3750
sampling_param = SamplingParams(top_p=top_p, temperature=temperature, max_tokens=max_new_tokens)
38-
3951

40-
outputs = model.generate(user_prompt, sampling_params=sampling_param)
52+
lora_request = None
53+
if peft_model_name:
54+
lora_request = LoRARequest("lora",0,peft_model_name)
55+
56+
req_id = str(uuid.uuid4())
57+
58+
generator = model.generate(user_prompt, sampling_param, req_id, lora_request=lora_request)
59+
output = None
60+
async for request_output in generator:
61+
output = request_output
4162

42-
print(f"model output:\n {user_prompt} {outputs[0].outputs[0].text}")
63+
print(f"model output:\n {user_prompt} {output.outputs[0].text}")
4364
user_prompt = input("Enter next prompt (press Enter to exit): ")
4465
if not user_prompt:
4566
break
4667

4768
def run_script(
4869
model_name: str,
49-
peft_model=None,
50-
tp_size=1,
70+
peft_model_name=None,
71+
pp_size : int = 1,
72+
tp_size : int = 1,
5173
max_new_tokens=100,
5274
user_prompt=None,
5375
top_p=0.9,
5476
temperature=0.8
5577
):
56-
model = load_model(model_name, tp_size)
57-
main(model, max_new_tokens, user_prompt, top_p, temperature)
78+
model = load_model(model_name, peft_model_name, pp_size, tp_size)
79+
80+
asyncio.get_event_loop().run_until_complete(main(model, peft_model_name, max_new_tokens, user_prompt, top_p, temperature))
5881

5982
if __name__ == "__main__":
6083
fire.Fire(run_script)

recipes/quickstart/finetuning/LLM_finetuning_overview.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## LLM Fine-Tuning
22

3-
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
3+
Here we discuss fine-tuning Meta Llama with a couple of different recipes. We will cover two scenarios here:
44

55

66
## 1. **Parameter Efficient Model Fine-Tuning**
@@ -18,8 +18,6 @@ These methods will address three aspects:
1818

1919
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
2020

21-
22-
2321
## 2. **Full/ Partial Parameter Fine-Tuning**
2422

2523
Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

recipes/quickstart/finetuning/multigpu_finetuning.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,32 @@ If you are running full parameter fine-tuning on the 70B model, you can enable `
6868
torchrun --nnodes 1 --nproc_per_node 8 finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
6969
```
7070

71+
**Multi GPU multi node**:
7172

73+
Here we use a slurm script to schedule a job with slurm over multiple nodes.
74+
75+
```bash
76+
77+
sbatch recipes/quickstart/finetuning/multi_node.slurm
78+
# Change the num nodes and GPU per nodes in the script before running.
79+
80+
```
81+
82+
To fine-tune the Meta Llama 405B model with LoRA on 32xH100, 80 GB GPUs we need to combine 4bit quantization (QLoRA) and FSDP.
83+
We can achieve this by adding the following environment variables to the slurm script (before the srun command in the bottom).
84+
85+
```bash
86+
export FSDP_CPU_RAM_EFFICIENT_LOADING=1
87+
export ACCELERATE_USE_FSDP=1
88+
```
89+
90+
Then we need to replace the bottom srun command with the following:
91+
92+
```bash
93+
srun torchrun --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 ./finetuning.py --enable_fsdp --use_peft --peft_method lora --quantization 4bit --quantization_config.quant_type nf4 --mixed_precision False --low_cpu_fsdp
94+
```
95+
96+
Do not forget to adjust the number of nodes, ntasks and gpus-per-task in the top.
7297

7398
## Running with different datasets
7499
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).

recipes/quickstart/inference/local_inference/README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ samsum_prompt.txt
2727
...
2828
```
2929

30-
**Note**
31-
Currently pad token by default in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). We add the padding token as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
30+
**Note on Llama version < 3.1**
31+
The default padding token in [HuggingFace Tokenizer is `None`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L110). To use padding the padding token needs to be added as a special token to the tokenizer, which in this case requires to resize the token_embeddings as shown below:
3232

3333
```python
3434
tokenizer.add_special_tokens(
@@ -39,8 +39,7 @@ tokenizer.add_special_tokens(
3939
)
4040
model.resize_token_embeddings(model.config.vocab_size + 1)
4141
```
42-
Padding would be required for batch inference. In this this [example](inference.py), batch size = 1 so essentially padding is not required. However,We added the code pointer as an example in case of batch inference.
43-
42+
Padding would be required for batched inference. In this [example](inference.py), batch size = 1 so essentially padding is not required. However, we added the code pointer as an example in case of batch inference. For Llama version 3.1 use the special token `<|finetune_right_pad_id|> (128004)` for padding.
4443

4544
## Chat completion
4645
The inference folder also includes a chat completion example, that adds built-in safety features in fine-tuned models to the prompt tokens. To run the example:
@@ -85,3 +84,7 @@ Then run inference using:
8584
python inference.py --model_name <training_config.output_dir> --prompt_file <test_prompt_file>
8685

8786
```
87+
88+
## Inference on large models like Meta Llama 405B
89+
The FP8 quantized variants of Meta Llama (i.e. meta-llama/Meta-Llama-3.1-405B-FP8 and meta-llama/Meta-Llama-3.1-405B-Instruct-FP8) can be executed on a single node with 8x80GB H100 using the scripts located in this folder.
90+
To run the unquantized Meta Llama 405B variants (i.e. meta-llama/Meta-Llama-3.1-405B and meta-llama/Meta-Llama-3.1-405B-Instruct) we need to use a multi-node setup for inference. The llama-recipes inference script currently does not allow multi-node inference. To run this model you can use vLLM with pipeline and tensor parallelism as showed in [this example](../../../3p_integrations/vllm/README.md).

recipes/quickstart/inference/local_inference/chat_completion/chat_completion.py

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
# from accelerate import init_empty_weights, load_checkpoint_and_dispatch
55

66
import fire
7+
import json
78
import os
89
import sys
910

@@ -18,7 +19,7 @@
1819
def main(
1920
model_name,
2021
peft_model: str=None,
21-
quantization: bool=False,
22+
quantization: str = None, # Options: 4bit, 8bit
2223
max_new_tokens =256, #The maximum numbers of tokens to generate
2324
min_new_tokens:int=0, #The minimum numbers of tokens to generate
2425
prompt_file: str=None,
@@ -47,33 +48,32 @@ def main(
4748

4849
elif not sys.stdin.isatty():
4950
dialogs = "\n".join(sys.stdin.readlines())
51+
try:
52+
dialogs = json.loads(dialogs)
53+
except:
54+
print("Could not parse json from stdin. Please provide a json file with the user prompts. Exiting.")
55+
sys.exit(1)
5056
else:
5157
print("No user prompt provided. Exiting.")
5258
sys.exit(1)
5359

5460
print(f"User dialogs:\n{dialogs}")
5561
print("\n==================================\n")
56-
57-
62+
5863
# Set the seeds for reproducibility
5964
if is_xpu_available():
6065
torch.xpu.manual_seed(seed)
6166
else:
6267
torch.cuda.manual_seed(seed)
6368
torch.manual_seed(seed)
64-
model = load_model(model_name, quantization, use_fast_kernels)
69+
70+
model = load_model(model_name, quantization, use_fast_kernels, **kwargs)
6571
if peft_model:
6672
model = load_peft_model(model, peft_model)
6773

6874
tokenizer = AutoTokenizer.from_pretrained(model_name)
69-
tokenizer.add_special_tokens(
70-
{
71-
72-
"pad_token": "<PAD>",
73-
}
74-
)
7575

76-
chats = tokenizer.apply_chat_template(dialogs)
76+
chats = [tokenizer.apply_chat_template(dialog) for dialog in dialogs]
7777

7878
with torch.no_grad():
7979
for idx, chat in enumerate(chats):
@@ -99,12 +99,14 @@ def main(
9999
sys.exit(1) # Exit the program with an error status
100100
tokens= torch.tensor(chat).long()
101101
tokens= tokens.unsqueeze(0)
102+
attention_mask = torch.ones_like(tokens)
102103
if is_xpu_available():
103104
tokens= tokens.to("xpu:0")
104105
else:
105106
tokens= tokens.to("cuda:0")
106107
outputs = model.generate(
107108
input_ids=tokens,
109+
attention_mask=attention_mask,
108110
max_new_tokens=max_new_tokens,
109111
do_sample=do_sample,
110112
top_p=top_p,

0 commit comments

Comments
 (0)