You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/LLM_finetuning.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
## LLM Fine-Tuning
2
2
3
-
Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:
3
+
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
4
4
5
5
6
6
## 1. **Parameter Efficient Model Fine-Tuning**
7
-
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), LLaMA Adapter and Prefix-tuning.
7
+
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
8
8
9
9
10
10
These methods will address three aspects:
@@ -14,7 +14,7 @@ These methods will address three aspects:
14
14
15
15
-**Cost of deployment** – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
16
16
17
-
-**Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tunings.
17
+
-**Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tuning.
18
18
19
19
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
20
20
@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only fine-tune a few layers. The
42
42
43
43
44
44
45
-
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
45
+
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
46
46
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
47
47
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
Copy file name to clipboardExpand all lines: docs/multi_gpu.md
+58-36Lines changed: 58 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,9 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
6
6
7
7
2.[FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
8
8
9
-
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
9
+
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
10
10
11
-
## Requirements
11
+
## Requirements
12
12
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
13
13
14
14
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -24,7 +24,7 @@ This runs with the `samsum_dataset` for summarization application by default.
@@ -34,7 +34,7 @@ The args used in the command above are:
34
34
35
35
*`--use_peft` boolean flag to enable PEFT methods in the script
36
36
37
-
*`--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
37
+
*`--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`.
38
38
39
39
We use `torchrun` here to spawn multiple processes for FSDP.
40
40
@@ -43,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
43
43
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
peft_method: str="lora"# None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
142
+
use_peft: bool=False
143
+
from_peft_checkpoint: str=""# if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
144
+
output_dir: str="PATH/to/save/PEFT/model"
145
+
freeze_layers: bool=False
146
+
num_freeze_layers: int=1
147
+
quantization: bool=False
148
+
one_gpu: bool=False
149
+
save_model: bool=True
150
+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model"# will be used if using FSDP
151
+
dist_checkpoint_folder: str="fine-tuned"# will be used if using FSDP
152
+
save_optimizer: bool=False# will be used if using FSDP
153
+
use_fast_kernels: bool=False# Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
154
+
use_wandb: bool=False# Enable wandb for experient tracking
155
+
save_metrics: bool=False# saves training metrics to a json file for later plotting
156
+
flop_counter: bool=False# Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
157
+
flop_counter_start: int=3# The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
158
+
use_profiler: bool=False# Enable pytorch profiler, can not be used with flop counter at the same time.
159
+
profiler_dir: str="PATH/to/save/profiler/results"# will be used if using profiler
144
160
```
145
161
146
162
*[Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +183,9 @@ save_optimizer: bool=False
167
183
*`fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168
184
169
185
*`pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
186
+
187
+
## FLOPS Counting and Pytorch Profiling
188
+
189
+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
190
+
191
+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
0 commit comments