Complete guide to fine-tuning large language models using 4-bit quantization with QLoRA (Quantized Low-Rank Adaptation).
QLoRA enables fine-tuning 70B+ parameter models on consumer GPUs by:
- Loading base model in 4-bit (75% memory reduction)
- Training only small LoRA adapters (~20MB)
- Maintaining near-full-precision quality
Memory savings:
- Llama 2 70B: 140GB → 35GB (4-bit) + 20MB (LoRA) = 35GB total
- Fits on single A100 80GB!
Accuracy: <1% degradation vs full fine-tuning
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Step 1: Load model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Step 2: Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Step 3: Add LoRA adapters
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules="all-linear",
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 335M || all params: 70B || trainable%: 0.48%
# Step 4: Train
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./qlora-70b",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
bf16=True,
optim="paged_adamw_8bit",
logging_steps=10,
save_strategy="epoch"
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer
)
trainer.train()Train Llama 2 13B on RTX 4090 (24GB).
Step 1: Prepare dataset
from datasets import load_dataset
# Load instruction dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")
# Format for instruction tuning
def format_instruction(example):
return {
"text": f"### Human: {example['text']}\n### Assistant: {example['output']}"
}
dataset = dataset.map(format_instruction)Step 2: Configure quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # BF16 for stability
bnb_4bit_quant_type="nf4", # NormalFloat4 (recommended)
bnb_4bit_use_double_quant=True # Nested quantization
)Step 3: Load and prepare model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
tokenizer.pad_token = tokenizer.eos_token
# Enable gradient checkpointing (further memory savings)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)Step 4: Configure LoRA
from peft import LoraConfig
lora_config = LoraConfig(
r=16, # LoRA rank (lower = less memory)
lora_alpha=32, # Scaling factor
target_modules="all-linear", # Apply to all linear layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)Step 5: Train
training_args = TrainingArguments(
output_dir="./qlora-13b-results",
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 16
warmup_steps=100,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
optim="paged_adamw_8bit", # 8-bit optimizer
max_grad_norm=0.3,
max_steps=1000
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
max_seq_length=512
)
trainer.train()Memory usage: ~18GB on RTX 4090 (24GB)
Train Llama 2 70B on 8×A100 (80GB each).
Step 1: Configure FSDP-compatible quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16 # CRITICAL for FSDP!
)Important: bnb_4bit_quant_storage=torch.bfloat16 ensures 4-bit layers are wrapped identically to regular layers for FSDP sharding.
Step 2: Launch with accelerate
Create fsdp_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: true
fsdp_sharding_strategy: 1 # FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_processes: 8Launch training:
accelerate launch --config_file fsdp_config.yaml train_qlora.pytrain_qlora.py:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
torch_dtype=torch.bfloat16
)
# Rest same as single-GPU workflow
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(...)
trainer.train()Memory per GPU: ~40GB (70B model sharded across 8 GPUs)
Train Llama 3.1 405B on 8×H100 (80GB each).
Requirements:
- 8×H100 80GB GPUs
- 256GB+ system RAM
- FSDP + QLoRA
Configuration:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.bfloat16
)
lora_config = LoraConfig(
r=32, # Higher rank for 405B
lora_alpha=64,
target_modules="all-linear",
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
training_args = TrainingArguments(
per_device_train_batch_size=1, # Small batch
gradient_accumulation_steps=32, # Effective batch = 256
learning_rate=1e-4, # Lower LR for large model
bf16=True,
optim="paged_adamw_8bit",
gradient_checkpointing=True
)Memory per GPU: ~70GB (405B in 4-bit / 8 GPUs)
Controls adapter capacity:
| Model Size | Recommended r | Trainable Params | Use Case |
|---|---|---|---|
| 7B | 8-16 | ~4M | Simple tasks |
| 13B | 16-32 | ~8M | General fine-tuning |
| 70B | 32-64 | ~80M | Complex tasks |
| 405B | 64-128 | ~300M | Maximum capacity |
Trade-off: Higher r = more capacity but more memory and slower training
Scaling factor for LoRA updates:
effective_learning_rate = learning_rate * (lora_alpha / r)Recommended: lora_alpha = 2 × r
- r=16 → alpha=32
- r=64 → alpha=128
Options:
"all-linear": All linear layers (recommended for QLoRA)["q_proj", "v_proj"]: Only attention (minimal)["q_proj", "k_proj", "v_proj", "o_proj"]: All attention["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: Attention + FFN
Trade-off: More modules = better performance but more memory
| Model Size | Recommended LR |
|---|---|
| 7-13B | 2e-4 to 3e-4 |
| 70B | 1e-4 to 2e-4 |
| 405B | 5e-5 to 1e-4 |
Rule: Larger models need lower learning rates
effective_batch_size = per_device_batch_size × gradient_accumulation_steps × num_gpusRecommended effective batch sizes:
- Instruction tuning: 64-128
- Continued pretraining: 256-512
| Dtype | Speed | Accuracy | Use Case |
|---|---|---|---|
torch.float32 |
Slow | Best | Debugging |
torch.bfloat16 |
Fast | Good | Recommended |
torch.float16 |
Fastest | Risky | May have precision issues |
Save memory by recomputing activations:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)Memory savings: ~30-40% activation memory Cost: ~20% slower training
Quantize the quantization constants:
bnb_config = BitsAndBytesConfig(
bnb_4bit_use_double_quant=True # Enable nested quantization
)Memory savings: Additional ~2-3% reduction Accuracy: Minimal impact
For models that still don't fit:
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=bnb_config,
device_map="auto",
max_memory={0: "40GB", "cpu": "100GB"}
)Trade-off: Much slower but enables larger models
Use paged memory for optimizer states:
training_args = TrainingArguments(
optim="paged_adamw_8bit" # Or paged_adamw_32bit
)Benefit: Prevents OOM from optimizer states
# Save only adapters (~20MB)
model.save_pretrained("./qlora-adapters")
tokenizer.save_pretrained("./qlora-adapters")from peft import PeftModel
# Load base model in 4-bit
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# Load adapters
model = PeftModel.from_pretrained(base_model, "./qlora-adapters")
# Inference
inputs = tokenizer("Question here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=200)# Merge LoRA into base weights
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("./merged-model")Note: Merged model loses 4-bit quantization (back to FP16/BF16)
-
Reduce batch size:
per_device_train_batch_size=1
-
Increase gradient accumulation:
gradient_accumulation_steps=16
-
Lower LoRA rank:
r=8 # Instead of 16
-
Enable gradient checkpointing
-
Use CPU offloading
-
Increase LoRA rank:
r=64 # Instead of 16
-
Train longer:
num_train_epochs=3 # Instead of 1
-
Use more target modules:
target_modules="all-linear"
-
Check learning rate (try 1e-4 to 3e-4)
-
Disable gradient checkpointing (if memory allows)
-
Increase batch size
-
Use BF16:
bf16=True
-
Use paged optimizer
- Start small: Test on 7B before 70B
- Monitor loss: Should decrease steadily
- Use validation: Track eval loss to detect overfitting
- Save checkpoints: Every 100-500 steps
- Log hyperparameters: For reproducibility
- Test inference: Verify quality before full training
See full working example at examples/qlora_training.py in the repository.
- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
- bitsandbytes GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
- PEFT documentation: https://huggingface.co/docs/peft
- FSDP+QLoRA guide: https://huggingface.co/blog/fsdp-qlora