Skip to content

Commit 0ab53c2

Browse files
authored
Added a feature that allow users to use pytorch profiler or flop_counter to measure the performance during fine-tuning. (meta-llama#433)
2 parents 6ceb8b2 + 26e877f commit 0ab53c2

File tree

9 files changed

+363
-162
lines changed

9 files changed

+363
-162
lines changed

docs/multi_gpu.md

Lines changed: 48 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
88

99
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
1010

11-
## Requirements
11+
## Requirements
1212
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
1313

1414
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -115,32 +115,47 @@ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --m
115115
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
116116

117117
```python
118-
119-
model_name: str="PATH/to/LLAMA 2/7B"
120-
enable_fsdp: bool= False
121-
run_validation: bool=True
122-
batch_size_training: int=4
123-
gradient_accumulation_steps: int=1
124-
num_epochs: int=3
125-
num_workers_dataloader: int=2
126-
lr: float=2e-4
127-
weight_decay: float=0.0
128-
gamma: float= 0.85
129-
use_fp16: bool=False
130-
mixed_precision: bool=True
131-
val_batch_size: int=4
132-
dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
133-
peft_method: str = "lora" # None , llama_adapter, prefix
134-
use_peft: bool=False
135-
output_dir: str = "./ft-output"
136-
freeze_layers: bool = False
137-
num_freeze_layers: int = 1
138-
quantization: bool = False
139-
save_model: bool = False
140-
dist_checkpoint_root_folder: str="model_checkpoints"
141-
dist_checkpoint_folder: str="fine-tuned"
142-
save_optimizer: bool=False
143-
118+
model_name: str="PATH/to/Model"
119+
tokenizer_name: str=None
120+
enable_fsdp: bool=False
121+
low_cpu_fsdp: bool=False
122+
run_validation: bool=True
123+
batch_size_training: int=4
124+
batching_strategy: str="packing" #alternative: padding
125+
context_length: int=4096
126+
gradient_accumulation_steps: int=1
127+
gradient_clipping: bool = False
128+
gradient_clipping_threshold: float = 1.0
129+
num_epochs: int=3
130+
max_train_step: int=0
131+
max_eval_step: int=0
132+
num_workers_dataloader: int=1
133+
lr: float=1e-4
134+
weight_decay: float=0.0
135+
gamma: float= 0.85
136+
seed: int=42
137+
use_fp16: bool=False
138+
mixed_precision: bool=True
139+
val_batch_size: int=1
140+
dataset = "samsum_dataset"
141+
peft_method: str = "lora" # None,llama_adapter, prefix
142+
use_peft: bool=False
143+
output_dir: str = "PATH/to/save/PEFT/model"
144+
freeze_layers: bool = False
145+
num_freeze_layers: int = 1
146+
quantization: bool = False
147+
one_gpu: bool = False
148+
save_model: bool = True
149+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
150+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
151+
save_optimizer: bool=False # will be used if using FSDP
152+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
153+
use_wandb: bool = False # Enable wandb for experient tracking
154+
save_metrics: bool = False # saves training metrics to a json file for later plotting
155+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
156+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
157+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
158+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
144159
```
145160

146161
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +182,9 @@ save_optimizer: bool=False
167182
* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168183

169184
* `pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
185+
186+
## FLOPS Counting and Pytorch Profiling
187+
188+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
189+
190+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

docs/single_gpu.md

Lines changed: 48 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ To run fine-tuning on a single GPU, we will make use of two packages
88

99
Given combination of PEFT and Int8 quantization, we would be able to fine_tune a Llama 2 7B model on one consumer grade GPU such as A10.
1010

11-
## Requirements
11+
## Requirements
1212
To run the examples, make sure to install the llama-recipes package (See [README.md](../README.md) for details).
1313

1414
**Please note that the llama-recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -71,35 +71,55 @@ python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization
7171
It let us specify the training settings, everything from `model_name` to `dataset_name`, `batch_size` etc. can be set here. Below is the list of supported settings:
7272

7373
```python
74-
75-
model_name: str="PATH/to/LLAMA 2/7B"
76-
enable_fsdp: bool= False
77-
run_validation: bool=True
78-
batch_size_training: int=4
79-
gradient_accumulation_steps: int=1
80-
num_epochs: int=3
81-
num_workers_dataloader: int=2
82-
lr: float=2e-4
83-
weight_decay: float=0.0
84-
gamma: float= 0.85
85-
use_fp16: bool=False
86-
mixed_precision: bool=True
87-
val_batch_size: int=4
88-
dataset = "samsum_dataset" # alpaca_dataset,grammar_dataset
89-
peft_method: str = "lora" # None , llama_adapter, prefix
90-
use_peft: bool=False
91-
output_dir: str = "./ft-output"
92-
freeze_layers: bool = False
93-
num_freeze_layers: int = 1
94-
quantization: bool = False
95-
one_gpu: bool = False
96-
save_model: bool = False
97-
dist_checkpoint_root_folder: str="model_checkpoints"
98-
dist_checkpoint_folder: str="fine-tuned"
99-
save_optimizer: bool=False
100-
74+
model_name: str="PATH/to/Model"
75+
tokenizer_name: str=None
76+
enable_fsdp: bool=False
77+
low_cpu_fsdp: bool=False
78+
run_validation: bool=True
79+
batch_size_training: int=4
80+
batching_strategy: str="packing" #alternative: padding
81+
context_length: int=4096
82+
gradient_accumulation_steps: int=1
83+
gradient_clipping: bool = False
84+
gradient_clipping_threshold: float = 1.0
85+
num_epochs: int=3
86+
max_train_step: int=0
87+
max_eval_step: int=0
88+
num_workers_dataloader: int=1
89+
lr: float=1e-4
90+
weight_decay: float=0.0
91+
gamma: float= 0.85
92+
seed: int=42
93+
use_fp16: bool=False
94+
mixed_precision: bool=True
95+
val_batch_size: int=1
96+
dataset = "samsum_dataset"
97+
peft_method: str = "lora" # None,llama_adapter, prefix
98+
use_peft: bool=False
99+
output_dir: str = "PATH/to/save/PEFT/model"
100+
freeze_layers: bool = False
101+
num_freeze_layers: int = 1
102+
quantization: bool = False
103+
one_gpu: bool = False
104+
save_model: bool = True
105+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
106+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
107+
save_optimizer: bool=False # will be used if using FSDP
108+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
109+
use_wandb: bool = False # Enable wandb for experient tracking
110+
save_metrics: bool = False # saves training metrics to a json file for later plotting
111+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
112+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
113+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
114+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
101115
```
102116

103117
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
104118

105119
* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
120+
121+
## FLOPS Counting and Pytorch Profiling
122+
123+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
124+
125+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/README.md

Lines changed: 50 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Finetuning Llama
22

3-
This folder contains instructions to fine-tune Llama 2 on a
3+
This folder contains instructions to fine-tune Llama 2 on a
44
* [single-GPU setup](./singlegpu_finetuning.md)
5-
* [multi-GPU setup](./multigpu_finetuning.md)
5+
* [multi-GPU setup](./multigpu_finetuning.md)
66

77
using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
88

@@ -23,32 +23,47 @@ If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetu
2323
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
2424

2525
```python
26-
27-
model_name: str="PATH/to/LLAMA 2/7B"
28-
enable_fsdp: bool= False
29-
run_validation: bool=True
30-
batch_size_training: int=4
31-
gradient_accumulation_steps: int=1
32-
num_epochs: int=3
33-
num_workers_dataloader: int=2
34-
lr: float=2e-4
35-
weight_decay: float=0.0
36-
gamma: float= 0.85
37-
use_fp16: bool=False
38-
mixed_precision: bool=True
39-
val_batch_size: int=4
40-
dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
41-
peft_method: str = "lora" # None , llama_adapter, prefix
42-
use_peft: bool=False
43-
output_dir: str = "./ft-output"
44-
freeze_layers: bool = False
45-
num_freeze_layers: int = 1
46-
quantization: bool = False
47-
save_model: bool = False
48-
dist_checkpoint_root_folder: str="model_checkpoints"
49-
dist_checkpoint_folder: str="fine-tuned"
50-
save_optimizer: bool=False
51-
26+
model_name: str="PATH/to/Model"
27+
tokenizer_name: str=None
28+
enable_fsdp: bool=False
29+
low_cpu_fsdp: bool=False
30+
run_validation: bool=True
31+
batch_size_training: int=4
32+
batching_strategy: str="packing" #alternative: padding
33+
context_length: int=4096
34+
gradient_accumulation_steps: int=1
35+
gradient_clipping: bool = False
36+
gradient_clipping_threshold: float = 1.0
37+
num_epochs: int=3
38+
max_train_step: int=0
39+
max_eval_step: int=0
40+
num_workers_dataloader: int=1
41+
lr: float=1e-4
42+
weight_decay: float=0.0
43+
gamma: float= 0.85
44+
seed: int=42
45+
use_fp16: bool=False
46+
mixed_precision: bool=True
47+
val_batch_size: int=1
48+
dataset = "samsum_dataset"
49+
peft_method: str = "lora" # None,llama_adapter, prefix
50+
use_peft: bool=False
51+
output_dir: str = "PATH/to/save/PEFT/model"
52+
freeze_layers: bool = False
53+
num_freeze_layers: int = 1
54+
quantization: bool = False
55+
one_gpu: bool = False
56+
save_model: bool = True
57+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
58+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
59+
save_optimizer: bool=False # will be used if using FSDP
60+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
61+
use_wandb: bool = False # Enable wandb for experient tracking
62+
save_metrics: bool = False # saves training metrics to a json file for later plotting
63+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
64+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
65+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
66+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
5267
```
5368

5469
* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -84,7 +99,13 @@ You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb`
8499
```bash
85100
python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model --use_wandb
86101
```
87-
You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
102+
You'll be able to access a dedicated project or run link on [wandb.ai](https://wandb.ai) and see your dashboard like the one below.
88103
<div style="display: flex;">
89104
<img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
90105
</div>
106+
107+
## FLOPS Counting and Pytorch Profiling
108+
109+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
110+
111+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/multigpu_finetuning.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ We will also need 2 packages:
99
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
1010
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
1111

12-
> [!NOTE]
12+
> [!NOTE]
1313
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
1414
>
1515
> INT8 quantization is not currently supported in FSDP
@@ -30,7 +30,7 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
3030
<details>
3131
<summary>Multi-node Multi-GPU</summary>
3232
Here we use a slurm script to schedule a job with slurm over multiple nodes.
33-
33+
3434
# Change the num nodes and GPU per nodes in the script before running.
3535
sbatch ./multi_node.slurm
3636

@@ -95,7 +95,7 @@ torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name
9595

9696

9797
## [TIP] Slow interconnect between nodes?
98-
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
98+
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
9999

100100
HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
101101

@@ -107,5 +107,8 @@ torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_f
107107

108108
```
109109

110+
## FLOPS Counting and Pytorch Profiling
110111

112+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
111113

114+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

0 commit comments

Comments
 (0)