Skip to content

Commit 7a08c27

Browse files
authored
Merge branch 'main' into fix/finetune_readme
2 parents e56356b + 0ab53c2 commit 7a08c27

File tree

9 files changed

+359
-152
lines changed

9 files changed

+359
-152
lines changed

docs/multi_gpu.md

Lines changed: 47 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -115,32 +115,47 @@ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --m
115115
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
116116

117117
```python
118-
119-
model_name: str="PATH/to/Model"
120-
enable_fsdp: bool= False
121-
run_validation: bool=True
122-
batch_size_training: int=4
123-
gradient_accumulation_steps: int=1
124-
num_epochs: int=3
125-
num_workers_dataloader: int=2
126-
lr: float=2e-4
127-
weight_decay: float=0.0
128-
gamma: float= 0.85
129-
use_fp16: bool=False
130-
mixed_precision: bool=True
131-
val_batch_size: int=4
132-
dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
133-
peft_method: str = "lora" # None , llama_adapter, prefix
134-
use_peft: bool=False
135-
output_dir: str = "./ft-output"
136-
freeze_layers: bool = False
137-
num_freeze_layers: int = 1
138-
quantization: bool = False
139-
save_model: bool = False
140-
dist_checkpoint_root_folder: str="model_checkpoints"
141-
dist_checkpoint_folder: str="fine-tuned"
142-
save_optimizer: bool=False
143-
118+
model_name: str="PATH/to/Model"
119+
tokenizer_name: str=None
120+
enable_fsdp: bool=False
121+
low_cpu_fsdp: bool=False
122+
run_validation: bool=True
123+
batch_size_training: int=4
124+
batching_strategy: str="packing" #alternative: padding
125+
context_length: int=4096
126+
gradient_accumulation_steps: int=1
127+
gradient_clipping: bool = False
128+
gradient_clipping_threshold: float = 1.0
129+
num_epochs: int=3
130+
max_train_step: int=0
131+
max_eval_step: int=0
132+
num_workers_dataloader: int=1
133+
lr: float=1e-4
134+
weight_decay: float=0.0
135+
gamma: float= 0.85
136+
seed: int=42
137+
use_fp16: bool=False
138+
mixed_precision: bool=True
139+
val_batch_size: int=1
140+
dataset = "samsum_dataset"
141+
peft_method: str = "lora" # None,llama_adapter, prefix
142+
use_peft: bool=False
143+
output_dir: str = "PATH/to/save/PEFT/model"
144+
freeze_layers: bool = False
145+
num_freeze_layers: int = 1
146+
quantization: bool = False
147+
one_gpu: bool = False
148+
save_model: bool = True
149+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
150+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
151+
save_optimizer: bool=False # will be used if using FSDP
152+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
153+
use_wandb: bool = False # Enable wandb for experient tracking
154+
save_metrics: bool = False # saves training metrics to a json file for later plotting
155+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
156+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
157+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
158+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
144159
```
145160

146161
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +182,9 @@ save_optimizer: bool=False
167182
* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168183

169184
* `pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
185+
186+
## FLOPS Counting and Pytorch Profiling
187+
188+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
189+
190+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

docs/single_gpu.md

Lines changed: 47 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -71,35 +71,55 @@ python -m llama_recipes.finetuning --use_peft --peft_method lora --quantization
7171
It let us specify the training settings, everything from `model_name` to `dataset_name`, `batch_size` etc. can be set here. Below is the list of supported settings:
7272

7373
```python
74-
75-
model_name: str="PATH/to/Model"
76-
enable_fsdp: bool= False
77-
run_validation: bool=True
78-
batch_size_training: int=4
79-
gradient_accumulation_steps: int=1
80-
num_epochs: int=3
81-
num_workers_dataloader: int=2
82-
lr: float=2e-4
83-
weight_decay: float=0.0
84-
gamma: float= 0.85
85-
use_fp16: bool=False
86-
mixed_precision: bool=True
87-
val_batch_size: int=4
88-
dataset = "samsum_dataset" # alpaca_dataset,grammar_dataset
89-
peft_method: str = "lora" # None , llama_adapter, prefix
90-
use_peft: bool=False
91-
output_dir: str = "./ft-output"
92-
freeze_layers: bool = False
93-
num_freeze_layers: int = 1
94-
quantization: bool = False
95-
one_gpu: bool = False
96-
save_model: bool = False
97-
dist_checkpoint_root_folder: str="model_checkpoints"
98-
dist_checkpoint_folder: str="fine-tuned"
99-
save_optimizer: bool=False
100-
74+
model_name: str="PATH/to/Model"
75+
tokenizer_name: str=None
76+
enable_fsdp: bool=False
77+
low_cpu_fsdp: bool=False
78+
run_validation: bool=True
79+
batch_size_training: int=4
80+
batching_strategy: str="packing" #alternative: padding
81+
context_length: int=4096
82+
gradient_accumulation_steps: int=1
83+
gradient_clipping: bool = False
84+
gradient_clipping_threshold: float = 1.0
85+
num_epochs: int=3
86+
max_train_step: int=0
87+
max_eval_step: int=0
88+
num_workers_dataloader: int=1
89+
lr: float=1e-4
90+
weight_decay: float=0.0
91+
gamma: float= 0.85
92+
seed: int=42
93+
use_fp16: bool=False
94+
mixed_precision: bool=True
95+
val_batch_size: int=1
96+
dataset = "samsum_dataset"
97+
peft_method: str = "lora" # None,llama_adapter, prefix
98+
use_peft: bool=False
99+
output_dir: str = "PATH/to/save/PEFT/model"
100+
freeze_layers: bool = False
101+
num_freeze_layers: int = 1
102+
quantization: bool = False
103+
one_gpu: bool = False
104+
save_model: bool = True
105+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
106+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
107+
save_optimizer: bool=False # will be used if using FSDP
108+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
109+
use_wandb: bool = False # Enable wandb for experient tracking
110+
save_metrics: bool = False # saves training metrics to a json file for later plotting
111+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
112+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
113+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
114+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
101115
```
102116

103117
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
104118

105119
* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
120+
121+
## FLOPS Counting and Pytorch Profiling
122+
123+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
124+
125+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/README.md

Lines changed: 49 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Finetuning Llama
22

3+
34
This folder contains instructions to fine-tune Meta Llama 3 on a
5+
46
* [single-GPU setup](./singlegpu_finetuning.md)
57
* [multi-GPU setup](./multigpu_finetuning.md)
68

@@ -23,32 +25,47 @@ If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetu
2325
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
2426

2527
```python
26-
27-
model_name: str="PATH/to/Model"
28-
enable_fsdp: bool= False
29-
run_validation: bool=True
30-
batch_size_training: int=4
31-
gradient_accumulation_steps: int=1
32-
num_epochs: int=3
33-
num_workers_dataloader: int=2
34-
lr: float=2e-4
35-
weight_decay: float=0.0
36-
gamma: float= 0.85
37-
use_fp16: bool=False
38-
mixed_precision: bool=True
39-
val_batch_size: int=4
40-
dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
41-
peft_method: str = "lora" # None , llama_adapter, prefix
42-
use_peft: bool=False
43-
output_dir: str = "./ft-output"
44-
freeze_layers: bool = False
45-
num_freeze_layers: int = 1
46-
quantization: bool = False
47-
save_model: bool = False
48-
dist_checkpoint_root_folder: str="model_checkpoints"
49-
dist_checkpoint_folder: str="fine-tuned"
50-
save_optimizer: bool=False
51-
28+
model_name: str="PATH/to/Model"
29+
tokenizer_name: str=None
30+
enable_fsdp: bool=False
31+
low_cpu_fsdp: bool=False
32+
run_validation: bool=True
33+
batch_size_training: int=4
34+
batching_strategy: str="packing" #alternative: padding
35+
context_length: int=4096
36+
gradient_accumulation_steps: int=1
37+
gradient_clipping: bool = False
38+
gradient_clipping_threshold: float = 1.0
39+
num_epochs: int=3
40+
max_train_step: int=0
41+
max_eval_step: int=0
42+
num_workers_dataloader: int=1
43+
lr: float=1e-4
44+
weight_decay: float=0.0
45+
gamma: float= 0.85
46+
seed: int=42
47+
use_fp16: bool=False
48+
mixed_precision: bool=True
49+
val_batch_size: int=1
50+
dataset = "samsum_dataset"
51+
peft_method: str = "lora" # None,llama_adapter, prefix
52+
use_peft: bool=False
53+
output_dir: str = "PATH/to/save/PEFT/model"
54+
freeze_layers: bool = False
55+
num_freeze_layers: int = 1
56+
quantization: bool = False
57+
one_gpu: bool = False
58+
save_model: bool = True
59+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
60+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
61+
save_optimizer: bool=False # will be used if using FSDP
62+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
63+
use_wandb: bool = False # Enable wandb for experient tracking
64+
save_metrics: bool = False # saves training metrics to a json file for later plotting
65+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
66+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
67+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
68+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
5269
```
5370

5471
* [Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -88,3 +105,9 @@ You'll be able to access a dedicated project or run link on [wandb.ai](https://w
88105
<div style="display: flex;">
89106
<img src="../../docs/images/wandb_screenshot.png" alt="wandb screenshot" width="500" />
90107
</div>
108+
109+
## FLOPS Counting and Pytorch Profiling
110+
111+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
112+
113+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/multigpu_finetuning.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,3 +106,10 @@ This will require to set the Sharding strategy in [fsdp config](../../src/llama_
106106
torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_fsdp --fsdp_config.pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --hsdp --sharding_group_size n --replica_group_size world_size/n
107107

108108
```
109+
110+
## FLOPS Counting and Pytorch Profiling
111+
112+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
113+
114+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
115+

recipes/finetuning/singlegpu_finetuning.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,9 @@ python -m finetuning.py --use_peft --peft_method lora --quantization --dataset
6060
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /patht_of_model_folder/8B --output_dir Path/to/save/PEFT/model
6161

6262
```
63+
64+
## FLOPS Counting and Pytorch Profiling
65+
66+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
67+
68+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ gradio
1818
chardet
1919
openai
2020
typing-extensions==4.8.0
21+
tabulate

src/llama_recipes/configs/training.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
@dataclass
88
class train_config:
9-
model_name: str="PATH/to/LLAMA/7B"
9+
model_name: str="PATH/to/Model"
1010
tokenizer_name: str=None
1111
enable_fsdp: bool=False
1212
low_cpu_fsdp: bool=False
@@ -29,7 +29,7 @@ class train_config:
2929
mixed_precision: bool=True
3030
val_batch_size: int=1
3131
dataset = "samsum_dataset"
32-
peft_method: str = "lora" # None , llama_adapter, prefix
32+
peft_method: str = "lora" # None,llama_adapter, prefix
3333
use_peft: bool=False
3434
output_dir: str = "PATH/to/save/PEFT/model"
3535
freeze_layers: bool = False
@@ -43,3 +43,7 @@ class train_config:
4343
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
4444
use_wandb: bool = False # Enable wandb for experient tracking
4545
save_metrics: bool = False # saves training metrics to a json file for later plotting
46+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
47+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
48+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
49+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler

0 commit comments

Comments
 (0)