Feedback on Training Parameters for Fine-Tuning a 7B RAG LLM, slow training:/

I’m fine-tuning a 7B RAG LLM and running into some issues with training speed and CUDA memory constraints. Here are my training parameters:

training_args = TrainingArguments(
    output_dir="models/fine_tuned_2001",
    overwrite_output_dir=True,
    num_train_epochs=3,
    warmup_steps=20,
    logging_strategy="steps",
    logging_steps=10,
    evaluation_strategy="no",
    optim="adamw_torch",
    gradient_accumulation_steps=4,
    save_steps=100,
    save_total_limit=2,
    learning_rate=1e-5,
    per_device_train_batch_size=1,
    max_steps=1000,
    report_to="wandb"
)
Setup & Issues:

Hardware: 22GB GPU
Input Length: MAX_LENGTH=10154 (because the model takes query, answer, and chunks as input).
Dataset: ~2K pairs.
Problem:

Training is extremely slow—around 1 minute per step, meaning 1000 steps take ~16 hours.
I expected slowness due to large input lengths, but this seems excessive.
Am I overlooking something? Any tips on improving training speed without exceeding memory limits?
Batch Size: Had to reduce per_device_train_batch_size=1 due to CUDA OOM errors. Also reduced LoRA Settings to r=64, lora_alpha=16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feedback on Training Parameters for Fine-Tuning a 7B RAG LLM, slow training:/ #191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feedback on Training Parameters for Fine-Tuning a 7B RAG LLM, slow training:/ #191

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions