You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/multi_gpu.md
+48-27Lines changed: 48 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
8
8
9
9
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
10
10
11
-
## Requirements
11
+
## Requirements
12
12
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
13
13
14
14
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model"# will be used if using FSDP
150
+
dist_checkpoint_folder: str="fine-tuned"# will be used if using FSDP
151
+
save_optimizer: bool=False# will be used if using FSDP
152
+
use_fast_kernels: bool=False# Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
153
+
use_wandb: bool=False# Enable wandb for experient tracking
154
+
save_metrics: bool=False# saves training metrics to a json file for later plotting
155
+
flop_counter: bool=False# Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
156
+
flop_counter_start: int=3# The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
157
+
use_profiler: bool=False# Enable pytorch profiler, can not be used with flop counter at the same time.
158
+
profiler_dir: str="PATH/to/save/profiler/results"# will be used if using profiler
144
159
```
145
160
146
161
*[Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +182,9 @@ save_optimizer: bool=False
167
182
*`fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168
183
169
184
*`pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
185
+
186
+
## FLOPS Counting and Pytorch Profiling
187
+
188
+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
189
+
190
+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
Copy file name to clipboardExpand all lines: docs/single_gpu.md
+48-28Lines changed: 48 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ To run fine-tuning on a single GPU, we will make use of two packages
8
8
9
9
Given combination of PEFT and Int8 quantization, we would be able to fine_tune a Llama 2 7B model on one consumer grade GPU such as A10.
10
10
11
-
## Requirements
11
+
## Requirements
12
12
To run the examples, make sure to install the llama-recipes package (See [README.md](../README.md) for details).
13
13
14
14
**Please note that the llama-recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
It let us specify the training settings, everything from `model_name` to `dataset_name`, `batch_size` etc. can be set here. Below is the list of supported settings:
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model"# will be used if using FSDP
106
+
dist_checkpoint_folder: str="fine-tuned"# will be used if using FSDP
107
+
save_optimizer: bool=False# will be used if using FSDP
108
+
use_fast_kernels: bool=False# Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
109
+
use_wandb: bool=False# Enable wandb for experient tracking
110
+
save_metrics: bool=False# saves training metrics to a json file for later plotting
111
+
flop_counter: bool=False# Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
112
+
flop_counter_start: int=3# The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
113
+
use_profiler: bool=False# Enable pytorch profiler, can not be used with flop counter at the same time.
114
+
profiler_dir: str="PATH/to/save/profiler/results"# will be used if using profiler
101
115
```
102
116
103
117
*[Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
104
118
105
119
*[peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
120
+
121
+
## FLOPS Counting and Pytorch Profiling
122
+
123
+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
124
+
125
+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
Copy file name to clipboardExpand all lines: recipes/finetuning/README.md
+50-29Lines changed: 50 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# Finetuning Llama
2
2
3
-
This folder contains instructions to fine-tune Llama 2 on a
3
+
This folder contains instructions to fine-tune Llama 2 on a
4
4
*[single-GPU setup](./singlegpu_finetuning.md)
5
-
*[multi-GPU setup](./multigpu_finetuning.md)
5
+
*[multi-GPU setup](./multigpu_finetuning.md)
6
6
7
7
using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) in the llama-recipes package.
8
8
@@ -23,32 +23,47 @@ If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetu
23
23
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model"# will be used if using FSDP
58
+
dist_checkpoint_folder: str="fine-tuned"# will be used if using FSDP
59
+
save_optimizer: bool=False# will be used if using FSDP
60
+
use_fast_kernels: bool=False# Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
61
+
use_wandb: bool=False# Enable wandb for experient tracking
62
+
save_metrics: bool=False# saves training metrics to a json file for later plotting
63
+
flop_counter: bool=False# Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
64
+
flop_counter_start: int=3# The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
65
+
use_profiler: bool=False# Enable pytorch profiler, can not be used with flop counter at the same time.
66
+
profiler_dir: str="PATH/to/save/profiler/results"# will be used if using profiler
52
67
```
53
68
54
69
*[Datasets config file](../../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -84,7 +99,13 @@ You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb`
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
110
+
111
+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
Copy file name to clipboardExpand all lines: recipes/finetuning/multigpu_finetuning.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ We will also need 2 packages:
9
9
1.[PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
10
10
2.[FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
11
11
12
-
> [!NOTE]
12
+
> [!NOTE]
13
13
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
14
14
>
15
15
> INT8 quantization is not currently supported in FSDP
@@ -30,7 +30,7 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
30
30
<details>
31
31
<summary>Multi-node Multi-GPU</summary>
32
32
Here we use a slurm script to schedule a job with slurm over multiple nodes.
33
-
33
+
34
34
# Change the num nodes and GPU per nodes in the script before running.
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
98
+
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
99
99
100
100
HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
111
113
114
+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
0 commit comments