Skip to content

Commit d9558c1

Browse files
committed
changed context name and add more docs
1 parent fe51935 commit d9558c1

File tree

7 files changed

+46
-19
lines changed

7 files changed

+46
-19
lines changed

docs/multi_gpu.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ To run fine-tuning on multi-GPUs, we will make use of two packages:
88

99
Given the combination of PEFT and FSDP, we would be able to fine tune a Llama 2 model on multiple GPUs in one node or multi-node.
1010

11-
## Requirements
11+
## Requirements
1212
To run the examples, make sure to install the llama-recipes package and clone the github repository in order to use the provided [`finetuning.py`](../recipes/finetuning/finetuning.py) script with torchrun (See [README.md](../README.md) for details).
1313

1414
**Please note that the llama_recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -140,7 +140,10 @@ save_model: bool = False
140140
dist_checkpoint_root_folder: str="model_checkpoints"
141141
dist_checkpoint_folder: str="fine-tuned"
142142
save_optimizer: bool=False
143-
143+
flop_counter: bool=False # Enable FLOPS counter to measure model throughput, can not be used with pytorch profiler at the same time.
144+
flop_counter_start: int=3 # The step to start profiling, default is 3, which means after 3 steps of warm-up stage, the profiler will start to count FLOPS.
145+
use_profiler: bool=False # Enable pytorch profiler, can not be used with FLOPS counter at the same time.
146+
profiler_dir: str="PATH/to/save/profiler/results" # will be used if using profiler
144147
```
145148

146149
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
@@ -167,3 +170,9 @@ save_optimizer: bool=False
167170
* `fsdp_activation_checkpointing` enables activation checkpoining for FSDP, this saves significant amount of memory with the trade off of recomputing itermediate activations during the backward pass. The saved memory can be re-invested in higher batch sizes to increase the throughput. We recommond you use this option.
168171

169172
* `pure_bf16` it moves the model to `BFloat16` and if `optimizer` is set to `anyprecision` then optimizer states will be kept in `BFloat16` as well. You can use this option if necessary.
173+
174+
## FLOPS Counting and Pytorch Profiling
175+
176+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
177+
178+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

docs/single_gpu.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ To run fine-tuning on a single GPU, we will make use of two packages
88

99
Given combination of PEFT and Int8 quantization, we would be able to fine_tune a Llama 2 7B model on one consumer grade GPU such as A10.
1010

11-
## Requirements
11+
## Requirements
1212
To run the examples, make sure to install the llama-recipes package (See [README.md](../README.md) for details).
1313

1414
**Please note that the llama-recipes package will install PyTorch 2.0.1 version, in case you want to run FSDP + PEFT, please make sure to install PyTorch nightlies.**
@@ -97,9 +97,18 @@ save_model: bool = False
9797
dist_checkpoint_root_folder: str="model_checkpoints"
9898
dist_checkpoint_folder: str="fine-tuned"
9999
save_optimizer: bool=False
100-
100+
flop_counter: bool=False # Enable FLOPS counter to measure model throughput, can not be used with pytorch profiler at the same time.
101+
flop_counter_start: int=3 # The step to start profiling, default is 3, which means after 3 steps of warm-up stage, the profiler will start to count FLOPS.
102+
use_profiler: bool=False # Enable pytorch profiler, can not be used with FLOPS counter at the same time.
103+
profiler_dir: str="PATH/to/save/profiler/results" # will be used if using profiler
101104
```
102105

103106
* [Datasets config file](../src/llama_recipes/configs/datasets.py) provides the available options for datasets.
104107

105108
* [peft config file](../src/llama_recipes/configs/peft.py) provides the supported PEFT methods and respective settings that can be modified.
109+
110+
## FLOPS Counting and Pytorch Profiling
111+
112+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
113+
114+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,4 +98,4 @@ You'll be able to access a dedicated project or run link on [wandb.ai](https://w
9898

9999
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
100100

101-
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). This would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.
101+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/multigpu_finetuning.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ We will also need 2 packages:
99
1. [PEFT](https://github.com/huggingface/peft) to use parameter-efficient finetuning.
1010
2. [FSDP](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) which helps us parallelize the training over multiple GPUs. [More details](./LLM_finetuning_overview.md#2-full-partial-parameter-finetuning).
1111

12-
> [!NOTE]
12+
> [!NOTE]
1313
> The llama-recipes package will install PyTorch 2.0.1 version. In case you want to use FSDP with PEFT for multi GPU finetuning, please install the PyTorch nightlies ([details](../../README.md#pytorch-nightlies))
1414
>
1515
> INT8 quantization is not currently supported in FSDP
@@ -30,7 +30,7 @@ Get access to a machine with multiple GPUs (in this case we tested with 4 A100 a
3030
<details>
3131
<summary>Multi-node Multi-GPU</summary>
3232
Here we use a slurm script to schedule a job with slurm over multiple nodes.
33-
33+
3434
# Change the num nodes and GPU per nodes in the script before running.
3535
sbatch ./multi_node.slurm
3636

@@ -95,7 +95,7 @@ torchrun --nnodes 1 --nproc_per_node 4 finetuning.py --enable_fsdp --model_name
9595

9696

9797
## [TIP] Slow interconnect between nodes?
98-
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
98+
In case you are dealing with slower interconnect network between nodes, to reduce the communication overhead you can make use of `--hsdp` flag.
9999

100100
HSDP (Hybrid sharding Data Parallel) helps to define a hybrid sharding strategy where you can have FSDP within `sharding_group_size` which can be the minimum number of GPUs you can fit your model and DDP between the replicas of the model specified by `replica_group_size`.
101101

@@ -107,5 +107,8 @@ torchrun --nnodes 4 --nproc_per_node 8 ./finetuning.py --enable_fsdp --low_cpu_f
107107

108108
```
109109

110+
## FLOPS Counting and Pytorch Profiling
110111

112+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
111113

114+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

recipes/finetuning/singlegpu_finetuning.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ The args used in the command above are:
2424
* `--peft_method` to specify the PEFT method, here we use `lora` other options are `llama_adapter`, `prefix`.
2525
* `--quantization` boolean flag to enable int8 quantization
2626

27-
> [!NOTE]
27+
> [!NOTE]
2828
> In case you are using a multi-GPU machine please make sure to only make one of them visible using `export CUDA_VISIBLE_DEVICES=GPU:id`.
2929
30-
30+
3131
### How to run with different datasets?
3232

3333
Currently 3 open source datasets are supported that can be found in [Datasets config file](../../src/llama_recipes/configs/datasets.py). You can also use your custom dataset (more info [here](./datasets/README.md)).
@@ -60,3 +60,9 @@ python -m finetuning.py --use_peft --peft_method lora --quantization --dataset
6060
python -m finetuning.py --use_peft --peft_method lora --quantization --dataset samsum_dataset --model_name /patht_of_model_folder/7B --output_dir Path/to/save/PEFT/model
6161

6262
```
63+
64+
## FLOPS Counting and Pytorch Profiling
65+
66+
To help with benchmarking effort, we are adding the support for counting the FLOPS during the fine-tuning process. You can achieve this by setting `--flop_counter` when launching your single/multi GPU fine-tuning. Use `--flop_counter_start` to choose which step to count the FLOPS. It is recommended to allow a warm-up stage before using the FLOPS counter.
67+
68+
Similarly, you can set `--use_profiler` flag and pass a profiling output path using `--profiler_dir` to capture the profile traces of your model using [PyTorch profiler](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). To get accurate profiling result, the pytorch profiler requires a warm-up stage and the current config is wait=1, warmup=2, active=3, thus the profiler will start the profiling after step 3 and will record the next 3 steps. Therefore, in order to use pytorch profiler, the --max-train-step has been greater than 6. The pytorch profiler would be helpful for debugging purposes. However, the `--flop_counter` and `--use_profiler` can not be used in the same time to ensure the measurement accuracy.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ gradio
1818
chardet
1919
openai
2020
typing-extensions==4.8.0
21+
tabulate

src/llama_recipes/utils/train_utils.py

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ def set_tokenizer_params(tokenizer: LlamaTokenizer):
3131
tokenizer.padding_side = "left"
3232

3333
@contextlib.contextmanager
34-
def throughput_measure_context(cfg, local_rank=None):
34+
def profile(cfg, local_rank=None):
3535
use_profiler: bool = cfg.use_profiler
3636
use_flop_counter: bool = cfg.flop_counter
3737
if use_flop_counter and use_profiler:
@@ -41,7 +41,7 @@ def throughput_measure_context(cfg, local_rank=None):
4141
wait_step, warmup_step, active_step = 1, 2, 3
4242
min_step = wait_step + warmup_step + active_step + 1
4343
if cfg.max_train_step > 0 and cfg.max_train_step < min_step:
44-
raise ValueError(f"pytorch profiler requires at least {min_step} train steps, please increase the max_train_step, current max_train_step {cfg.max_train_step}")
44+
raise ValueError(f"pytorch profiler requires at least {min_step} train steps to finish the warm-up and recording stage, {wait_step} for wait_step, {warmup_step} for warmup_step, {active_step} for profiling step, please increase the max_train_step, current max_train_step {cfg.max_train_step}")
4545
print(f"pytorch profiling is activated and results will be saved in {cfg.profiler_dir}")
4646
with torch.profiler.profile(
4747
activities=[
@@ -97,7 +97,6 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
9797

9898

9999
autocast = torch.cuda.amp.autocast if train_config.use_fp16 else nullcontext
100-
101100
train_prep = []
102101
train_loss = []
103102
val_prep = []
@@ -127,7 +126,7 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
127126
total_loss = 0.0
128127
total_length = len(train_dataloader)//gradient_accumulation_steps
129128
pbar = tqdm(colour="blue", desc=f"Training Epoch: {epoch+1}", total=total_length, dynamic_ncols=True)
130-
with throughput_measure_context(train_config,local_rank) as measure_context:
129+
with profile(train_config,local_rank) as profile_context:
131130
for step, batch in enumerate(train_dataloader):
132131
total_train_steps += 1
133132
# stop when the maximum number of training steps is reached
@@ -138,7 +137,7 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
138137
break
139138
if train_config.flop_counter and total_train_steps == train_config.flop_counter_start:
140139
print("start flop counting at the step: ", total_train_steps)
141-
measure_context.start_counting()
140+
profile_context.start_counting()
142141
for key in batch.keys():
143142
if train_config.enable_fsdp:
144143
if is_xpu_available():
@@ -185,10 +184,10 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
185184
optimizer.zero_grad()
186185
pbar.update(1)
187186
if train_config.use_profiler:
188-
measure_context.step()
189-
if train_config.flop_counter and measure_context.is_ready():
190-
TFlops = measure_context.get_total_flops() / 1e12
191-
measure_context.stop_counting()
187+
profile_context.step()
188+
if train_config.flop_counter and profile_context.is_ready():
189+
TFlops = profile_context.get_total_flops() / 1e12
190+
profile_context.stop_counting()
192191
if wandb_run:
193192
if not train_config.enable_fsdp or rank==0:
194193
wandb_run.log({

0 commit comments

Comments
 (0)