Skip to content

Feedback on Training Parameters for Fine-Tuning a 7B RAG LLM, slow training:/ #191

@linahadhri

Description

@linahadhri

I’m fine-tuning a 7B RAG LLM and running into some issues with training speed and CUDA memory constraints. Here are my training parameters:

training_args = TrainingArguments(
output_dir="models/fine_tuned_2001",
overwrite_output_dir=True,
num_train_epochs=3,
warmup_steps=20,
logging_strategy="steps",
logging_steps=10,
evaluation_strategy="no",
optim="adamw_torch",
gradient_accumulation_steps=4,
save_steps=100,
save_total_limit=2,
learning_rate=1e-5,
per_device_train_batch_size=1,
max_steps=1000,
report_to="wandb"
)
Setup & Issues:

Hardware: 22GB GPU
Input Length: MAX_LENGTH=10154 (because the model takes query, answer, and chunks as input).
Dataset: ~2K pairs.
Problem:

Training is extremely slow—around 1 minute per step, meaning 1000 steps take ~16 hours.
I expected slowness due to large input lengths, but this seems excessive.
Am I overlooking something? Any tips on improving training speed without exceeding memory limits?
Batch Size: Had to reduce per_device_train_batch_size=1 due to CUDA OOM errors. Also reduced LoRA Settings to r=64, lora_alpha=16.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions