Skip to content

Latest commit

 

History

History
521 lines (389 loc) · 11.7 KB

File metadata and controls

521 lines (389 loc) · 11.7 KB

QLoRA Training

Complete guide to fine-tuning large language models using 4-bit quantization with QLoRA (Quantized Low-Rank Adaptation).

Overview

QLoRA enables fine-tuning 70B+ parameter models on consumer GPUs by:

  • Loading base model in 4-bit (75% memory reduction)
  • Training only small LoRA adapters (~20MB)
  • Maintaining near-full-precision quality

Memory savings:

  • Llama 2 70B: 140GB → 35GB (4-bit) + 20MB (LoRA) = 35GB total
  • Fits on single A100 80GB!

Accuracy: <1% degradation vs full fine-tuning

Quick Start

Basic QLoRA Fine-tuning

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Step 1: Load model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Step 2: Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Step 3: Add LoRA adapters
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 335M || all params: 70B || trainable%: 0.48%

# Step 4: Train
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./qlora-70b",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    bf16=True,
    optim="paged_adamw_8bit",
    logging_steps=10,
    save_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer
)

trainer.train()

Complete Training Workflows

Workflow 1: Single GPU Training (Consumer GPU)

Train Llama 2 13B on RTX 4090 (24GB).

Step 1: Prepare dataset

from datasets import load_dataset

# Load instruction dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")

# Format for instruction tuning
def format_instruction(example):
    return {
        "text": f"### Human: {example['text']}\n### Assistant: {example['output']}"
    }

dataset = dataset.map(format_instruction)

Step 2: Configure quantization

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # BF16 for stability
    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
    bnb_4bit_use_double_quant=True  # Nested quantization
)

Step 3: Load and prepare model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Enable gradient checkpointing (further memory savings)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

Step 4: Configure LoRA

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,  # LoRA rank (lower = less memory)
    lora_alpha=32,  # Scaling factor
    target_modules="all-linear",  # Apply to all linear layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Step 5: Train

training_args = TrainingArguments(
    output_dir="./qlora-13b-results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    warmup_steps=100,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    optim="paged_adamw_8bit",  # 8-bit optimizer
    max_grad_norm=0.3,
    max_steps=1000
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    max_seq_length=512
)

trainer.train()

Memory usage: ~18GB on RTX 4090 (24GB)

Workflow 2: Multi-GPU Training (FSDP + QLoRA)

Train Llama 2 70B on 8×A100 (80GB each).

Step 1: Configure FSDP-compatible quantization

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16  # CRITICAL for FSDP!
)

Important: bnb_4bit_quant_storage=torch.bfloat16 ensures 4-bit layers are wrapped identically to regular layers for FSDP sharding.

Step 2: Launch with accelerate

Create fsdp_config.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_sharding_strategy: 1  # FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_processes: 8

Launch training:

accelerate launch --config_file fsdp_config.yaml train_qlora.py

train_qlora.py:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)

# Rest same as single-GPU workflow
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

trainer = SFTTrainer(...)
trainer.train()

Memory per GPU: ~40GB (70B model sharded across 8 GPUs)

Workflow 3: Extremely Large Models (405B)

Train Llama 3.1 405B on 8×H100 (80GB each).

Requirements:

  • 8×H100 80GB GPUs
  • 256GB+ system RAM
  • FSDP + QLoRA

Configuration:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_storage=torch.bfloat16
)

lora_config = LoraConfig(
    r=32,  # Higher rank for 405B
    lora_alpha=64,
    target_modules="all-linear",
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

training_args = TrainingArguments(
    per_device_train_batch_size=1,  # Small batch
    gradient_accumulation_steps=32,  # Effective batch = 256
    learning_rate=1e-4,  # Lower LR for large model
    bf16=True,
    optim="paged_adamw_8bit",
    gradient_checkpointing=True
)

Memory per GPU: ~70GB (405B in 4-bit / 8 GPUs)

Hyperparameter Tuning

LoRA Rank (r)

Controls adapter capacity:

Model Size Recommended r Trainable Params Use Case
7B 8-16 ~4M Simple tasks
13B 16-32 ~8M General fine-tuning
70B 32-64 ~80M Complex tasks
405B 64-128 ~300M Maximum capacity

Trade-off: Higher r = more capacity but more memory and slower training

LoRA Alpha

Scaling factor for LoRA updates:

effective_learning_rate = learning_rate * (lora_alpha / r)

Recommended: lora_alpha = 2 × r

  • r=16 → alpha=32
  • r=64 → alpha=128

Target Modules

Options:

  • "all-linear": All linear layers (recommended for QLoRA)
  • ["q_proj", "v_proj"]: Only attention (minimal)
  • ["q_proj", "k_proj", "v_proj", "o_proj"]: All attention
  • ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: Attention + FFN

Trade-off: More modules = better performance but more memory

Learning Rate

Model Size Recommended LR
7-13B 2e-4 to 3e-4
70B 1e-4 to 2e-4
405B 5e-5 to 1e-4

Rule: Larger models need lower learning rates

Batch Size

effective_batch_size = per_device_batch_size × gradient_accumulation_steps × num_gpus

Recommended effective batch sizes:

  • Instruction tuning: 64-128
  • Continued pretraining: 256-512

Quantization Dtype

Dtype Speed Accuracy Use Case
torch.float32 Slow Best Debugging
torch.bfloat16 Fast Good Recommended
torch.float16 Fastest Risky May have precision issues

Advanced Techniques

Gradient Checkpointing

Save memory by recomputing activations:

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

Memory savings: ~30-40% activation memory Cost: ~20% slower training

Nested Quantization

Quantize the quantization constants:

bnb_config = BitsAndBytesConfig(
    bnb_4bit_use_double_quant=True  # Enable nested quantization
)

Memory savings: Additional ~2-3% reduction Accuracy: Minimal impact

CPU Offloading

For models that still don't fit:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=bnb_config,
    device_map="auto",
    max_memory={0: "40GB", "cpu": "100GB"}
)

Trade-off: Much slower but enables larger models

Paged Optimizers

Use paged memory for optimizer states:

training_args = TrainingArguments(
    optim="paged_adamw_8bit"  # Or paged_adamw_32bit
)

Benefit: Prevents OOM from optimizer states

Deployment

Save LoRA Adapters

# Save only adapters (~20MB)
model.save_pretrained("./qlora-adapters")
tokenizer.save_pretrained("./qlora-adapters")

Load for Inference

from peft import PeftModel

# Load base model in 4-bit
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Load adapters
model = PeftModel.from_pretrained(base_model, "./qlora-adapters")

# Inference
inputs = tokenizer("Question here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=200)

Merge Adapters (Optional)

# Merge LoRA into base weights
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./merged-model")

Note: Merged model loses 4-bit quantization (back to FP16/BF16)

Troubleshooting

OOM During Training

  1. Reduce batch size:

    per_device_train_batch_size=1
  2. Increase gradient accumulation:

    gradient_accumulation_steps=16
  3. Lower LoRA rank:

    r=8  # Instead of 16
  4. Enable gradient checkpointing

  5. Use CPU offloading

Low Quality Results

  1. Increase LoRA rank:

    r=64  # Instead of 16
  2. Train longer:

    num_train_epochs=3  # Instead of 1
  3. Use more target modules:

    target_modules="all-linear"
  4. Check learning rate (try 1e-4 to 3e-4)

Slow Training

  1. Disable gradient checkpointing (if memory allows)

  2. Increase batch size

  3. Use BF16:

    bf16=True
  4. Use paged optimizer

Best Practices

  1. Start small: Test on 7B before 70B
  2. Monitor loss: Should decrease steadily
  3. Use validation: Track eval loss to detect overfitting
  4. Save checkpoints: Every 100-500 steps
  5. Log hyperparameters: For reproducibility
  6. Test inference: Verify quality before full training

Example: Complete Training Script

See full working example at examples/qlora_training.py in the repository.

References