Distributed LLM Training
Train a simple CNN on CIFAR-10:
python train_toy.py --epochs 2 --batch_size 64 --learning_rate 0.001python train_toy.py --mode singlepython train_toy.py --mode dpSplits batches across GPUs, but GPU:0 handles extra work.
torchrun --nproc_per_node=1 train_toy.py --mode ddpEach GPU trains independently on different data shards and syncs gradients via NCCL.
- If you see "CUDA out of memory": lower batch_size or use gradient accumulation.
- If training freezes: make sure you're using DistributedSampler and barriers where needed.
- If you see NCCL warning: add dist.destroy_process_group() at the end of training.
Hardware: 1× NVIDIA GPU
Dataset: AG News (1000 samples)
Model: bert-base-uncased
Batch Size: 8
Learning Rate: 5e-05
Epochs: 1
| Metric | Single GPU | DataParallel | DDP |
|---|---|---|---|
| Epoch Time (s) | 26.14 | 25.86 | 28.20 |
| Throughput (samples/s) | 38.26 | 38.67 | 35.46 |
| Peak Memory (MB) | 2242 | 2242 | 3119 |
- DataParallel shows slight performance improvement over Single GPU mode with ~1% faster throughput.
- DDP has distributed overhead on single GPU, resulting in ~7% slower throughput than baseline.
- Memory usage is consistent between Single GPU and DataParallel modes.
- DDP uses significantly more memory (~39% increase) due to distributed process setup overhead.
- Single GPU training remains most efficient for single-GPU setups with minimal overhead.
For multi-GPU scenarios, DDP would show substantial improvements due to true parallel gradient synchronization.
- Gradient accumulation prevents OOM on small GPUs.
- Checkpoint/resume supports crash recovery.
- Auto-wrap policy automatically shards large Transformer blocks (FSDP prep).
- Use with:
--auto_wrap
This training script supports automatic checkpointing and resume logic.
- At the end of each epoch, rank 0 saves
checkpoint.ptcontaining:- Current epoch number
- Model weights
- Optimizer state
- On restart, the script checks for
checkpoint.pt. - If found, it loads model/optimizer state and resumes from the next epoch.
- If not found, training starts fresh.
Example:
torchrun --nproc_per_node=1 checkpoint_ddp.py --mode ddp --epochs 3 --batch_size 16Script: train_fsdp_toy.py
This script demonstrates FSDP with a simple toy model:
- Model: Tiny MLP (10 → 64 → 2)
- Dataset: Random tensor data (64 samples)
- Wraps model with
FullyShardedDataParallel
Run:
torchrun --nproc_per_node=1 train_fsdp_toy.py --epochs 1 --batch_size 8g## Installation
Install required dependencies:
pip install -r requirements.txtOr install individually:
pip install torch transformers datasets peft accelerateScript: train_fsdp_hf.py
This script demonstrates FSDP training with real language models using Hugging Face transformers and LoRA (Low-Rank Adaptation):
- Model: Llama-2-7b-hf (7 billion parameters)
- Fine-tuning: LoRA (Low-Rank Adaptation) for efficient parameter updates
- Distributed Training: FSDP (Fully Sharded Data Parallel) for memory efficiency
- Dataset: WikiText-2 small subset (200 samples)
- Optimization: AdamW optimizer with mixed precision training
-
LoRA Configuration:
- Rank: 8
- Alpha: 32
- Target modules:
["q_proj", "v_proj"](Llama-specific) - Task type: Causal Language Modeling
-
FSDP Setup:
- Auto-wrap policy for
LlamaDecoderLayer - TF32 support for Ampere GPUs
- Proper device placement and memory management
- Auto-wrap policy for
-
Training Features:
- Mixed precision with
torch.cuda.amp.autocast - Gradient scaling for stability
- Memory monitoring and checkpointing
- Distributed process group management
- Mixed precision with
Single GPU (NO_SHARD mode):
torchrun --nproc_per_node=1 train_fsdp_hf.pyMulti-GPU (FULL_SHARD mode):
torchrun --nproc_per_node=2 train_fsdp_hf.py- Minimum: 4GB+ GPU memory (for smaller models)
- Recommended: 16GB+ GPU memory for Llama-2-7b
- Multi-GPU: Distributes model across available GPUs
- Training progress with loss values
- Peak memory usage statistics
- Model checkpoint saved as
fsdp_lora_checkpoint.pt
- With 1 GPU, FSDP runs in NO_SHARD mode (no memory savings)
- True sharding benefits require
--nproc_per_node > 1 - For testing on smaller GPUs, consider switching to TinyLlama or GPT-2 models
Script: deepspeed=accelerate/train_ds_hf.py
This script demonstrates training with Hugging Face's Accelerate library and Trainer API:
- Model: DistilBERT-base-uncased for binary classification
- Fine-tuning: LoRA (Low-Rank Adaptation) with PEFT
- Training: Hugging Face Trainer with Accelerate
- Dataset: Synthetic movie review data (200 samples)
- Optimization: AdamW optimizer with gradient accumulation
accelerate launch "deepspeed=accelerate/train_ds_hf.py"Hardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1
LoRA Rank: 8, Alpha: 32
| Metric | Value |
|---|---|
| Training Time (s) | 10.01 |
| Throughput (samples/s) | 19.98 |
| Steps per Second | 2.50 |
| Final Loss | 0.461 |
| Gradient Norm | 1.51-1.62 |
| Peak Memory | GPU optimized |
Loading tokenizer and model...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 7794.29 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
Training 1 epoch...
{'loss': 0.6157, 'grad_norm': 1.508832573890686, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3932, 'grad_norm': 1.615255355834961, 'learning_rate': 4.8e-05, 'epoch': 0.8}
100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:10<00:00, 2.50it/s]
{'train_runtime': 10.0089, 'train_samples_per_second': 19.982, 'train_steps_per_second': 2.498, 'train_loss': 0.4607763671875, 'epoch': 1.0}
Training complete!
torch_dtypedeprecation: Usedtypeinstead (cosmetic warning)- Uninitialized weights: Normal for classification head when adapting pre-trained model
- 401 Unauthorized: Hugging Face Hub authentication issue (ignored safely)
- Tokenizer deprecation: Future version will use
processing_classinstead
- LoRA Configuration: Targets
["q_lin", "v_lin"]modules for DistilBERT - Synthetic Data: Avoids Hugging Face Hub authentication issues
- Mixed Precision: Disabled for stability (FP32 training)
- Gradient Accumulation: 8 steps for effective batch size of 8
- Windows Compatible: No DeepSpeed dependency
Script: deepspeed-accelerate/train_ds_hf.py
This script demonstrates training with DeepSpeed ZeRO-1 optimization using Hugging Face's Accelerate library:
- Model: DistilBERT-base-uncased for binary classification
- Optimization: DeepSpeed ZeRO-1 (Zero Redundancy Optimizer Stage 1)
- Training: Hugging Face Trainer with Accelerate + DeepSpeed
- Dataset: Synthetic movie review data (200 samples)
- Configuration: ZeRO-1 with FP16 mixed precision
cd deepspeed-accelerate
accelerate launch train_ds_hf.pyHardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1
| Metric | ZeRO-1 | ZeRO-2 | Standard Accelerate |
|---|---|---|---|
| Training Time (s) | 5.90 | 6.41 | 10.01 |
| Throughput (samples/s) | 33.90 | 31.20 | 19.98 |
| Steps per Second | 4.24 | 3.90 | 2.50 |
| Final Loss | 0.424 | 0.619 | 0.461 |
| Gradient Norm | 1.35-1.44 | 0.52-0.57 | 1.51-1.62 |
| Memory Optimization | Optimizer partitioning | Gradient + Optimizer partitioning | None |
Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 5581.99 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.5851, 'grad_norm': 1.4390928745269775, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3501, 'grad_norm': 1.352910041809082, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 5.9004, 'train_samples_per_second': 33.896, 'train_steps_per_second': 4.237, 'train_loss': 0.42357421875, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00, 4.24it/s]
Training complete!
Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 8341.23 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.659, 'grad_norm': 0.5713843703269958, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.5953, 'grad_norm': 0.5245606899261475, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 6.4096, 'train_samples_per_second': 31.203, 'train_steps_per_second': 3.9, 'train_loss': 0.618642578125, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00, 3.90it/s]
Training complete!
[rank0]:[W1024 10:39:03.109567454 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
- Faster Training: ~41% faster (5.90s vs 10.01s)
- Higher Throughput: 70% improvement (33.90 vs 19.98 samples/s)
- Optimizer State Partitioning: Reduces memory usage for optimizer states
- Faster Training: ~36% faster (6.41s vs 10.01s)
- Higher Throughput: 56% improvement (31.20 vs 19.98 samples/s)
- Gradient + Optimizer Partitioning: More aggressive memory optimization
- FP16 Mixed Precision: Enabled for faster computation
- Gradient Accumulation: DeepSpeed config overrides Accelerate settings
- Memory Efficiency: Both stages significantly reduce memory usage
- Consistent Performance: Both ZeRO stages outperform standard training
{
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"fp16": { "enabled": true },
"zero_allow_untested_optimizer": true
}{
"zero_optimization": {
"stage": 2,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"fp16": { "enabled": true },
"zero_allow_untested_optimizer": true
}- ZeRO-1: Partitions optimizer states across GPUs
- ZeRO-2: Partitions both gradients and optimizer states
- ZeRO-2: More memory efficient but slightly slower due to additional communication overhead
- Both: Use FP16 mixed precision and gradient accumulation