Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout

## Description

**Update**: Previously it was reported that the OOM was only for BNB, but now it is observed for Quantized Peft in general even for GPTQ. See #106 

Outliers
![image](https://github.com/user-attachments/assets/e52965d5-f044-4eb6-9244-ea50c01a1727)

**Previous description below describing issue only for BNB**

 
BNB experiments run out of memory in new benchmarks that set `lora_dropout=0.1`. 

| Benchmark | framework_config	| peft_method | model_name_or_path | num_gpus | per_device_train_batch_size | lora dropout | Peak Memory in Bytes |
| --------------- | --------------- | --------------- | --------------- |  ---------------  |  --------------- | --------------- | --------------- |
| Reference | accelerated-peft-bnb | lora | NousResearch/Llama-2-70b-hf | 2 | 4 | 0. | 72.39	|
| New | accelerated-peft-bnb | lora | NousResearch/Llama-2-70b-hf | 2 | 4 | 0.1 | 0. |

Compared to AutoGPTQ, we don't notice this issue
| Benchmark | framework_config	| peft_method | model_name_or_path | num_gpus | per_device_train_batch_size | lora dropout | Peak Memory in Bytes |
| --------------- | --------------- | --------------- | --------------- |  ---------------  |  --------------- | --------------- | --------------- |
| Reference | accelerated-peft-autogptq | lora | NousResearch/Llama-2-70b-hf | 2 | 4 | 0. | 70.14	|
| New | accelerated-peft-autogptq | lora | NousResearch/Llama-2-70b-hf | 2 | 4 | 0.1 | 71.7 |

There might be a slight overhead in the dropout implementation that causes the experiment to run out of memory for large models

## Reproduce Issue

### Lora Dropout=0. enters training 
```
export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0. --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False
```

### Lora Dropout=0.1 runs out of memory
```
export CUDA_VISIBLE_DEVICES=0,1
export ACCELERATION_FRAMEWORK_CONFIG_FILE=/workspace/fms-acceleration/scripts/benchmarks/../../sample-configurations/baseline-peft-bnb-nf4-sample-configuration.yaml
accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path NousResearch/Llama-2-70b-hf --packing True --max_seq_len 4096 --fp16 True --learning_rate 2e-4 --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.1 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '
### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path benchmark_outputs/data/cache.json --per_device_train_batch_size 4 --output_dir benchmark_outputs/exp_35/hf --skip_memory_metrics False
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout #50

Description

Reproduce Issue

Lora Dropout=0. enters training

Lora Dropout=0.1 runs out of memory

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.	72.39
New	accelerated-peft-bnb	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	0.

Benchmark	framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	lora dropout	Peak Memory in Bytes
Reference	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.	70.14
New	accelerated-peft-autogptq	lora	NousResearch/Llama-2-70b-hf	2	4	0.1	71.7

Quantized Peft Benchmark Experiments Run Out of Memory with Non-Zero Lora Dropout #50

Description

Description

Reproduce Issue

Lora Dropout=0. enters training

Lora Dropout=0.1 runs out of memory

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions