-
Notifications
You must be signed in to change notification settings - Fork 875
Description
I’m fine-tuning a 7B RAG LLM and running into some issues with training speed and CUDA memory constraints. Here are my training parameters:
training_args = TrainingArguments(
output_dir="models/fine_tuned_2001",
overwrite_output_dir=True,
num_train_epochs=3,
warmup_steps=20,
logging_strategy="steps",
logging_steps=10,
evaluation_strategy="no",
optim="adamw_torch",
gradient_accumulation_steps=4,
save_steps=100,
save_total_limit=2,
learning_rate=1e-5,
per_device_train_batch_size=1,
max_steps=1000,
report_to="wandb"
)
Setup & Issues:
Hardware: 22GB GPU
Input Length: MAX_LENGTH=10154 (because the model takes query, answer, and chunks as input).
Dataset: ~2K pairs.
Problem:
Training is extremely slow—around 1 minute per step, meaning 1000 steps take ~16 hours.
I expected slowness due to large input lengths, but this seems excessive.
Am I overlooking something? Any tips on improving training speed without exceeding memory limits?
Batch Size: Had to reduce per_device_train_batch_size=1 due to CUDA OOM errors. Also reduced LoRA Settings to r=64, lora_alpha=16.