Skip to content

Commit 61cdf88

Browse files
committed
2 parents 492eac7 + 7ef694c commit 61cdf88

File tree

87 files changed

+21414
-1948
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+21414
-1948
lines changed

scripts/spellcheck.sh renamed to .github/scripts/spellcheck.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,5 @@ done
1919
if [ ! "$sources_arg" ]; then
2020
echo "No files to spellcheck"
2121
else
22-
pyspelling -c scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources_arg
22+
pyspelling -c .github/scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources_arg
2323
fi

scripts/spellcheck_conf/spellcheck.yaml renamed to .github/scripts/spellcheck_conf/spellcheck.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ matrix:
55
d: en_US
66
dictionary:
77
wordlists:
8-
- scripts/spellcheck_conf/wordlist.txt
9-
output: scripts/spellcheck_conf/wordlist.dic
8+
- .github/scripts/spellcheck_conf/wordlist.txt
9+
output: .github/scripts/spellcheck_conf/wordlist.dic
1010
encoding: utf-8
1111
pipeline:
1212
- pyspelling.filters.context:

scripts/spellcheck_conf/wordlist.txt renamed to .github/scripts/spellcheck_conf/wordlist.txt

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1310,3 +1310,44 @@ leaderboards
13101310
txn
13111311
ollama
13121312
tavily
1313+
AgentExecutor
1314+
LangGraph
1315+
langgraph
1316+
vectorstore
1317+
CMake
1318+
Chipset
1319+
JBR
1320+
JNI
1321+
MLCChat
1322+
MTP
1323+
MacBook
1324+
Moreau
1325+
NDK
1326+
NDK's
1327+
OSX
1328+
OnePlus
1329+
OxygenOS
1330+
SoC
1331+
Sonoma
1332+
TVM
1333+
Thierry
1334+
Wifi
1335+
chipset
1336+
feb
1337+
moreau
1338+
octo
1339+
rustc
1340+
rustup
1341+
sha
1342+
tmoreau
1343+
toolchain
1344+
wifi
1345+
AgentFinish
1346+
ReAct
1347+
customizable
1348+
Kaggle
1349+
SalesBot
1350+
Weaviate
1351+
MediaGen
1352+
SDXL
1353+
SVD

.github/workflows/spellcheck.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ jobs:
2020
uses: gaurav-nelson/[email protected]
2121
with:
2222
use-verbose-mode: 'yes'
23-
config-file: "scripts/markdown_link_check_config.json"
23+
config-file: ".github/scripts/markdown_link_check_config.json"
2424

2525
- name: Get changed files
2626
id: changed-files
27-
uses: tj-actions/changed-files@v29.0.4
27+
uses: tj-actions/changed-files@v41.0.0
2828
with:
2929

3030
files: |
@@ -42,7 +42,7 @@ jobs:
4242
4343
- name: Get changed files
4444
id: changed-files
45-
uses: tj-actions/changed-files@v29.0.4
45+
uses: tj-actions/changed-files@v41.0.0
4646
with:
4747
files: |
4848
**/*.md
@@ -56,11 +56,11 @@ jobs:
5656
if [ ! "$sources" ]; then
5757
echo "No files to spellcheck"
5858
else
59-
pyspelling -c $GITHUB_WORKSPACE/scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources
59+
pyspelling -c $GITHUB_WORKSPACE/.github/scripts/spellcheck_conf/spellcheck.yaml --name Markdown $sources
6060
fi
6161
6262
- name: In the case of misspellings
6363
if: ${{ failure() }}
6464
run: |
6565
echo "Please fix the misspellings. If you are sure about some of them, "
66-
echo "so append those to scripts/spellcheck_conf/wordlist.txt"
66+
echo "so append those to .github/scripts/spellcheck_conf/wordlist.txt"

CONTRIBUTING.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,17 +43,17 @@ For development and contributing to llama-recipes please install from source wit
4343
pip install -U pip setuptools
4444
pip install --extra-index-url https://download.pytorch.org/whl/test/cu118 -e .[tests,auditnlg,vllm]
4545
```
46-
The unit tests can be found in the [tests](./tests/) folder and you can run them from the main directory using:
46+
The unit tests can be found in the [src/tests](./src/tests/) folder and you can run them from the main directory using:
4747
```
48-
python -m pytest tests/
48+
python -m pytest src/tests/
4949
```
5050
To run all tests of a single file you can give the filename directly:
5151
```
52-
python -m pytest tests/test_finetuning.py
52+
python -m pytest src/tests/test_finetuning.py
5353
```
5454
To run a specific test you can filter for its name with
5555
```
56-
python -m pytest tests/test_finetuning.py -k test_finetuning_peft
56+
python -m pytest src/tests/test_finetuning.py -k test_finetuning_peft
5757
```
5858
To add a new test simply create a new test file under the tests folder (filename has to start with `test_`).
5959
Group tests spanning the same feature in the same file and create a subfolder if the tests are very extensive.

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,10 @@ If you want to use PyTorch nightlies instead of the stable release, go to [this
6464
### Installing
6565
Llama-recipes provides a pip distribution for easy install and usage in other projects. Alternatively, it can be installed from source.
6666
67+
> [!NOTE]
68+
> Ensure you use the correct CUDA version (from `nvidia-smi`) when installing the PyTorch wheels. Here we are using 11.8 as `cu118`.
69+
> H100 GPUs work better with CUDA >12.0
70+
6771
#### Install with pip
6872
```
6973
pip install llama-recipes

docs/LLM_finetuning.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
## LLM Fine-Tuning
22

3-
Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:
3+
Here we discuss fine-tuning Meta Llama 3 with a couple of different recipes. We will cover two scenarios here:
44

55

66
## 1. **Parameter Efficient Model Fine-Tuning**
7-
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), LLaMA Adapter and Prefix-tuning.
7+
This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is [LORA](https://arxiv.org/pdf/2106.09685.pdf), Llama Adapter and Prefix-tuning.
88

99

1010
These methods will address three aspects:
@@ -14,7 +14,7 @@ These methods will address three aspects:
1414

1515
- **Cost of deployment** – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.
1616

17-
- **Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tunings.
17+
- **Catastrophic forgetting** — these methods also help with forgetting the first task that can happen in fine-tuning.
1818

1919
HF [PEFT](https://github.com/huggingface/peft) library provides an easy way of using these methods which we make use of here. Please read more [here](https://huggingface.co/blog/peft).
2020

@@ -42,7 +42,7 @@ You can also keep most of the layers frozen and only fine-tune a few layers. The
4242

4343

4444

45-
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won't fit into one gpu.
45+
In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Meta Llama 3 8B parameter won't fit into one gpu.
4646
The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes).
4747
For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.
4848

docs/multi_gpu.md

Lines changed: 58 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
66

77
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](LLM_finetuning.md/#2-full-partial-parameter-finetuning).
88

9-
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
9+
Given the combination of PEFT and FSDP, we would be able to fine tune a Meta Llama 3 8B model on multiple GPUs in one node or multi-node.
1010

11-
## Requirements
11+
## Requirements
1212
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
1313

1414
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -24,7 +24,7 @@ This runs with the `samsum_dataset` for summarization application by default.
2424

2525
```bash
2626

27-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
27+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model
2828

2929
```
3030

@@ -34,7 +34,7 @@ The args used in the command above are:
3434

3535
* `--use_peft` boolean flag to enable PEFT methods in the script
3636

37-
* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
37+
* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`.
3838

3939
We use `torchrun` here to spawn multiple processes for FSDP.
4040

@@ -43,7 +43,7 @@ We use `torchrun` here to spawn multiple processes for FSDP.
4343
Setting `use_fast_kernels` will enable using of Flash Attention or Xformer memory-efficient kernels based on the hardware being used. This would speed up the fine-tuning job. This has been enabled in `optimum` library from HuggingFace as a one-liner API, please read more [here](https://pytorch.org/blog/out-of-the-box-acceleration/).
4444

4545
```bash
46-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
46+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --output_dir Path/to/save/PEFT/model --use_fast_kernels
4747
```
4848

4949
### Fine-tuning using FSDP Only
@@ -52,7 +52,7 @@ If interested in running full parameter finetuning without making use of PEFT me
5252

5353
```bash
5454

55-
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
55+
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --use_fast_kernels
5656

5757
```
5858

@@ -62,7 +62,7 @@ If you are interested in running full parameter fine-tuning on the 70B model, yo
6262

6363
```bash
6464

65-
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /patht_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
65+
torchrun --nnodes 1 --nproc_per_node 8 examples/finetuning.py --enable_fsdp --low_cpu_fsdp --pure_bf16 --model_name /path_of_model_folder/70B --batch_size_training 1 --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned
6666

6767
```
6868

@@ -95,16 +95,16 @@ To run with each of the datasets set the `dataset` flag in the command as shown
9595

9696
```bash
9797
# grammer_dataset
98-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
98+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset grammar_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
9999

100100
# alpaca_dataset
101101

102-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
102+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset alpaca_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
103103

104104

105105
# samsum_dataset
106106

107-
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /patht_of_model_folder/7B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
107+
torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --model_name /path_of_model_folder/8B --use_peft --peft_method lora --dataset samsum_dataset --save_model --dist_checkpoint_root_folder model_checkpoints --dist_checkpoint_folder fine-tuned --pure_bf16 --output_dir Path/to/save/PEFT/model
108108

109109
```
110110

@@ -115,32 +115,48 @@ torchrun --nnodes 1 --nproc_per_node 4 examples/finetuning.py --enable_fsdp --m
115115
It lets us specify the training settings for everything from `model_name` to `dataset_name`, `batch_size` and so on. Below is the list of supported settings:
116116

117117
```python
118-
119-
model_name: str="PATH/to/LLAMA 2/7B"
120-
enable_fsdp: bool= False
121-
run_validation: bool=True
122-
batch_size_training: int=4
123-
gradient_accumulation_steps: int=1
124-
num_epochs: int=3
125-
num_workers_dataloader: int=2
126-
lr: float=2e-4
127-
weight_decay: float=0.0
128-
gamma: float= 0.85
129-
use_fp16: bool=False
130-
mixed_precision: bool=True
131-
val_batch_size: int=4
132-
dataset = "samsum_dataset" # alpaca_dataset, grammar_dataset
133-
peft_method: str = "lora" # None , llama_adapter, prefix
134-
use_peft: bool=False
135-
output_dir: str = "./ft-output"
136-
freeze_layers: bool = False
137-
num_freeze_layers: int = 1
138-
quantization: bool = False
139-
save_model: bool = False
140-
dist_checkpoint_root_folder: str="model_checkpoints"
141-
dist_checkpoint_folder: str="fine-tuned"
142-
save_optimizer: bool=False
143-
118+
model_name: str="PATH/to/Model"
119+
tokenizer_name: str=None
120+
enable_fsdp: bool=False
121+
low_cpu_fsdp: bool=False
122+
run_validation: bool=True
123+
batch_size_training: int=4
124+
batching_strategy: str="packing" #alternative: padding
125+
context_length: int=4096
126+
gradient_accumulation_steps: int=1
127+
gradient_clipping: bool = False
128+
gradient_clipping_threshold: float = 1.0
129+
num_epochs: int=3
130+
max_train_step: int=0
131+
max_eval_step: int=0
132+
num_workers_dataloader: int=1
133+
lr: float=1e-4
134+
weight_decay: float=0.0
135+
gamma: float= 0.85
136+
seed: int=42
137+
use_fp16: bool=False
138+
mixed_precision: bool=True
139+
val_batch_size: int=1
140+
dataset = "samsum_dataset"
141+
peft_method: str = "lora" # None, llama_adapter (Caution: llama_adapter is currently not supported with FSDP)
142+
use_peft: bool=False
143+
from_peft_checkpoint: str="" # if not empty and use_peft=True, will load the peft checkpoint and resume the fine-tuning on that checkpoint
144+
output_dir: str = "PATH/to/save/PEFT/model"
145+
freeze_layers: bool = False
146+
num_freeze_layers: int = 1
147+
quantization: bool = False
148+
one_gpu: bool = False
149+
save_model: bool = True
150+
dist_checkpoint_root_folder: str="PATH/to/save/FSDP/model" # will be used if using FSDP
151+
dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
152+
save_optimizer: bool=False # will be used if using FSDP
153+
use_fast_kernels: bool = False # Enable using SDPA from PyTroch Accelerated Transformers, make use Flash Attention and Xformer memory-efficient kernels
154+
use_wandb: bool = False # Enable wandb for experient tracking
155+
save_metrics: bool = False # saves training metrics to a json file for later plotting
156+
flop_counter: bool = False # Enable flop counter to measure model throughput, can not be used with pytorch profiler at the same time.
157+
flop_counter_start: int = 3 # The step to start profiling, default is 3, which means after 3 steps of warmup stage, the profiler will start to count flops.
158+
use_profiler: bool = False # Enable pytorch profiler, can not be used with flop counter at the same time.
159+
profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler
144160
```
145161

146162
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +183,9 @@ save_optimizer: bool=False
167183
* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168184

169185
* `pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
186+
187+
## FLOPS Counting and Pytorch Profiling
188+
189+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
190+
191+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

0 commit comments

Comments
 (0)