You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/LLM_finetuning.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
## LLM Fine-Tuning
2
2
3
-
Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:
3
+
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
4
4
5
5
6
6
## 1. **Parameter Efficient Model Fine-Tuning**
7
-
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), LLaMA Adapter and Prefix-tuning.
7
+
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
8
8
9
9
10
10
These methods will address three aspects:
@@ -14,7 +14,7 @@ These methods will address three aspects:
14
14
15
15
-**Cost of deployment** – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
16
16
17
-
-**Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tunings.
17
+
-**Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tuning.
18
18
19
19
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
20
20
@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only fine-tune a few layers. The
42
42
43
43
44
44
45
-
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
45
+
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
46
46
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
47
47
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
Copy file name to clipboardExpand all lines: docs/multi_gpu.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
6
6
7
7
2.[FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
8
8
9
-
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
9
+
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
10
10
11
11
## Requirements
12
12
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
@@ -24,7 +24,7 @@ This runs with the `samsum_dataset` for summarization application by default.
@@ -43,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
43
43
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
Copy file name to clipboardExpand all lines: recipes/finetuning/LLM_finetuning_overview.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
## LLM Fine-Tuning
2
2
3
-
Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:
3
+
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
4
4
5
5
6
6
## 1. **Parameter Efficient Model Fine-Tuning**
@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only fine-tune a few layers. The
42
42
43
43
44
44
45
-
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
45
+
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
46
46
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
47
47
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
Copy file name to clipboardExpand all lines: recipes/finetuning/README.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
# Finetuning Llama
2
2
3
-
This folder contains instructions to fine-tune Llama 2 on a
3
+
4
+
This folder contains instructions to fine-tune Meta Llama 3 on a
5
+
4
6
*[single-GPU setup](./singlegpu_finetuning.md)
5
7
*[multi-GPU setup](./multigpu_finetuning.md)
6
8
@@ -9,7 +11,7 @@ using the canonical [finetuning script](../../src/llama_recipes/finetuning.py) i
9
11
If you are new to fine-tuning techniques, check out an overview: [](./LLM_finetuning_overview.md)
10
12
11
13
> [!TIP]
12
-
> If you want to try finetuning Llama 2 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
14
+
> If you want to try finetuning Meta Llama 3 with Huggingface's trainer, here is a Jupyter notebook with an [example](./huggingface_trainer/peft_finetuning.ipynb)
13
15
14
16
15
17
## How to configure finetuning settings?
@@ -97,7 +99,7 @@ It lets us specify the training settings for everything from `model_name` to `da
97
99
You can enable [W&B](https://wandb.ai/) experiment tracking by using `use_wandb` flag as below. You can change the project name, entity and other `wandb.init` arguments in `wandb_config`.
Copy file name to clipboardExpand all lines: recipes/finetuning/multigpu_finetuning.md
+7-6Lines changed: 7 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
# Fine-tuning with Multi GPU
2
-
This recipe steps you through how to finetune a Llama 2 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
2
+
This recipe steps you through how to finetune a Meta Llama 3 model on the text summarization task using the [samsum](https://huggingface.co/datasets/samsum) dataset on multiple GPUs in a single or across multiple nodes.
3
3
4
4
5
5
## Requirements
@@ -23,7 +23,7 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
@@ -49,7 +49,7 @@ The args used in the command above are:
49
49
If interested in running full parameter finetuning without making use of PEFT methods, please use the following command. Make sure to change the `nproc_per_node` to your available GPUs. This has been tested with `BF16` on 8xA100, 40GB GPUs.
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
113
113
114
114
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
0 commit comments