GitHub - aaravM123/distributed-llm-training

Distributed LLM Training

Train a simple CNN on CIFAR-10:

python train_toy.py --epochs 2 --batch_size 64 --learning_rate 0.001

Single GPU Mode

python train_toy.py --mode single

DataParallel Mode

python train_toy.py --mode dp

Splits batches across GPUs, but GPU:0 handles extra work.

DistributedDataParallel Mode (WSL/Linux)

torchrun --nproc_per_node=1 train_toy.py --mode ddp

Each GPU trains independently on different data shards and syncs gradients via NCCL.

Troubleshooting DDP

If you see "CUDA out of memory": lower batch_size or use gradient accumulation.
If training freezes: make sure you're using DistributedSampler and barriers where needed.
If you see NCCL warning: add dist.destroy_process_group() at the end of training.

DDP vs DataParallel Benchmarks

Hardware: 1× NVIDIA GPU
Dataset: AG News (1000 samples)
Model: bert-base-uncased
Batch Size: 8
Learning Rate: 5e-05
Epochs: 1

Metric	Single GPU	DataParallel	DDP
Epoch Time (s)	26.14	25.86	28.20
Throughput (samples/s)	38.26	38.67	35.46
Peak Memory (MB)	2242	2242	3119

Observations

DataParallel shows slight performance improvement over Single GPU mode with ~1% faster throughput.
DDP has distributed overhead on single GPU, resulting in ~7% slower throughput than baseline.
Memory usage is consistent between Single GPU and DataParallel modes.
DDP uses significantly more memory (~39% increase) due to distributed process setup overhead.
Single GPU training remains most efficient for single-GPU setups with minimal overhead.

For multi-GPU scenarios, DDP would show substantial improvements due to true parallel gradient synchronization.

Handling OOM and Auto-Wrap Policies

Gradient accumulation prevents OOM on small GPUs.
Checkpoint/resume supports crash recovery.
Auto-wrap policy automatically shards large Transformer blocks (FSDP prep).
Use with: --auto_wrap

Checkpointing & Resume (DDP)

This training script supports automatic checkpointing and resume logic.

Saving

At the end of each epoch, rank 0 saves checkpoint.pt containing:
- Current epoch number
- Model weights
- Optimizer state

Resuming

On restart, the script checks for checkpoint.pt.
If found, it loads model/optimizer state and resumes from the next epoch.
If not found, training starts fresh.

Example:

torchrun --nproc_per_node=1 checkpoint_ddp.py --mode ddp --epochs 3 --batch_size 16

FSDP Setup in This Repo

Script: train_fsdp_toy.py

This script demonstrates FSDP with a simple toy model:

Model: Tiny MLP (10 → 64 → 2)
Dataset: Random tensor data (64 samples)
Wraps model with FullyShardedDataParallel

Run:

torchrun --nproc_per_node=1 train_fsdp_toy.py --epochs 1 --batch_size 8

g## Installation

Install required dependencies:

pip install -r requirements.txt

Or install individually:

pip install torch transformers datasets peft accelerate

FSDP with Hugging Face Models & LoRA

Script: train_fsdp_hf.py

This script demonstrates FSDP training with real language models using Hugging Face transformers and LoRA (Low-Rank Adaptation):

Features

Model: Llama-2-7b-hf (7 billion parameters)
Fine-tuning: LoRA (Low-Rank Adaptation) for efficient parameter updates
Distributed Training: FSDP (Fully Sharded Data Parallel) for memory efficiency
Dataset: WikiText-2 small subset (200 samples)
Optimization: AdamW optimizer with mixed precision training

Key Components

LoRA Configuration:
- Rank: 8
- Alpha: 32
- Target modules: ["q_proj", "v_proj"] (Llama-specific)
- Task type: Causal Language Modeling
FSDP Setup:
- Auto-wrap policy for LlamaDecoderLayer
- TF32 support for Ampere GPUs
- Proper device placement and memory management
Training Features:
- Mixed precision with torch.cuda.amp.autocast
- Gradient scaling for stability
- Memory monitoring and checkpointing
- Distributed process group management

Usage

Single GPU (NO_SHARD mode):

torchrun --nproc_per_node=1 train_fsdp_hf.py

Multi-GPU (FULL_SHARD mode):

torchrun --nproc_per_node=2 train_fsdp_hf.py

Memory Requirements

Minimum: 4GB+ GPU memory (for smaller models)
Recommended: 16GB+ GPU memory for Llama-2-7b
Multi-GPU: Distributes model across available GPUs

Output

Training progress with loss values
Peak memory usage statistics
Model checkpoint saved as fsdp_lora_checkpoint.pt

Notes

With 1 GPU, FSDP runs in NO_SHARD mode (no memory savings)
True sharding benefits require --nproc_per_node > 1
For testing on smaller GPUs, consider switching to TinyLlama or GPT-2 models

Accelerate Launch with Hugging Face Trainer

Script: deepspeed=accelerate/train_ds_hf.py

This script demonstrates training with Hugging Face's Accelerate library and Trainer API:

Features

Model: DistilBERT-base-uncased for binary classification
Fine-tuning: LoRA (Low-Rank Adaptation) with PEFT
Training: Hugging Face Trainer with Accelerate
Dataset: Synthetic movie review data (200 samples)
Optimization: AdamW optimizer with gradient accumulation

Usage

accelerate launch "deepspeed=accelerate/train_ds_hf.py"

Training Results

Hardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1
LoRA Rank: 8, Alpha: 32

Metric	Value
Training Time (s)	10.01
Throughput (samples/s)	19.98
Steps per Second	2.50
Final Loss	0.461
Gradient Norm	1.51-1.62
Peak Memory	GPU optimized

Training Progress Log

Loading tokenizer and model...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Map: 100%|██████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 7794.29 examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
Training 1 epoch...

{'loss': 0.6157, 'grad_norm': 1.508832573890686, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3932, 'grad_norm': 1.615255355834961, 'learning_rate': 4.8e-05, 'epoch': 0.8}

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:10<00:00, 2.50it/s]

{'train_runtime': 10.0089, 'train_samples_per_second': 19.982, 'train_steps_per_second': 2.498, 'train_loss': 0.4607763671875, 'epoch': 1.0}

Training complete!

Notes on Warnings

torch_dtype deprecation: Use dtype instead (cosmetic warning)
Uninitialized weights: Normal for classification head when adapting pre-trained model
401 Unauthorized: Hugging Face Hub authentication issue (ignored safely)
Tokenizer deprecation: Future version will use processing_class instead

Key Features

LoRA Configuration: Targets ["q_lin", "v_lin"] modules for DistilBERT
Synthetic Data: Avoids Hugging Face Hub authentication issues
Mixed Precision: Disabled for stability (FP32 training)
Gradient Accumulation: 8 steps for effective batch size of 8
Windows Compatible: No DeepSpeed dependency

DeepSpeed Training with ZeRO-1

Script: deepspeed-accelerate/train_ds_hf.py

This script demonstrates training with DeepSpeed ZeRO-1 optimization using Hugging Face's Accelerate library:

Features

Model: DistilBERT-base-uncased for binary classification
Optimization: DeepSpeed ZeRO-1 (Zero Redundancy Optimizer Stage 1)
Training: Hugging Face Trainer with Accelerate + DeepSpeed
Dataset: Synthetic movie review data (200 samples)
Configuration: ZeRO-1 with FP16 mixed precision

Usage

cd deepspeed-accelerate
accelerate launch train_ds_hf.py

Training Results Comparison

Hardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1

Metric	ZeRO-1	ZeRO-2	Standard Accelerate
Training Time (s)	5.90	6.41	10.01
Throughput (samples/s)	33.90	31.20	19.98
Steps per Second	4.24	3.90	2.50
Final Loss	0.424	0.619	0.461
Gradient Norm	1.35-1.44	0.52-0.57	1.51-1.62
Memory Optimization	Optimizer partitioning	Gradient + Optimizer partitioning	None

Training Progress Log (ZeRO-1)

Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 5581.99 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.5851, 'grad_norm': 1.4390928745269775, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3501, 'grad_norm': 1.352910041809082, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 5.9004, 'train_samples_per_second': 33.896, 'train_steps_per_second': 4.237, 'train_loss': 0.42357421875, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.24it/s]
Training complete!

Training Progress Log (ZeRO-2)

Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 8341.23 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.659, 'grad_norm': 0.5713843703269958, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.5953, 'grad_norm': 0.5245606899261475, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 6.4096, 'train_samples_per_second': 31.203, 'train_steps_per_second': 3.9, 'train_loss': 0.618642578125, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00,  3.90it/s]
Training complete!
[rank0]:[W1024 10:39:03.109567454 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

ZeRO Performance Benefits

ZeRO-1 vs Standard Accelerate:

Faster Training: ~41% faster (5.90s vs 10.01s)
Higher Throughput: 70% improvement (33.90 vs 19.98 samples/s)
Optimizer State Partitioning: Reduces memory usage for optimizer states

ZeRO-2 vs Standard Accelerate:

Faster Training: ~36% faster (6.41s vs 10.01s)
Higher Throughput: 56% improvement (31.20 vs 19.98 samples/s)
Gradient + Optimizer Partitioning: More aggressive memory optimization

Common Benefits:

FP16 Mixed Precision: Enabled for faster computation
Gradient Accumulation: DeepSpeed config overrides Accelerate settings
Memory Efficiency: Both stages significantly reduce memory usage
Consistent Performance: Both ZeRO stages outperform standard training

DeepSpeed Configurations

ZeRO-1 Configuration:

{
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "fp16": { "enabled": true },
    "zero_allow_untested_optimizer": true
}

ZeRO-2 Configuration:

{
    "zero_optimization": {
        "stage": 2,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "fp16": { "enabled": true },
    "zero_allow_untested_optimizer": true
}

Key Differences:

ZeRO-1: Partitions optimizer states across GPUs
ZeRO-2: Partitions both gradients and optimizer states
ZeRO-2: More memory efficient but slightly slower due to additional communication overhead
Both: Use FP16 mixed precision and gradient accumulation

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
data/cifar-10-batches-py		data/cifar-10-batches-py
deepspeed-accelerate		deepspeed-accelerate
outputs/checkpoint-25		outputs/checkpoint-25
plots		plots
results		results
wandb		wandb
.env		.env
.gitignore		.gitignore
README.md		README.md
accelerate_config.yaml		accelerate_config.yaml
check_cuda.py		check_cuda.py
check_gpu.py		check_gpu.py
checkpoint_ddp.py		checkpoint_ddp.py
checkpoint_fsdp.py		checkpoint_fsdp.py
env.template		env.template
former imports into single line		former imports into single line
inspect_checkpoint.py		inspect_checkpoint.py
requirements.txt		requirements.txt
setup_hf.py		setup_hf.py
setup_hf.sh		setup_hf.sh
train_ddp_hf.py		train_ddp_hf.py
train_fsdp_hf.py		train_fsdp_hf.py
train_fsdp_mixed.py		train_fsdp_mixed.py
train_fsdp_toy.py		train_fsdp_toy.py
train_toy.py		train_toy.py

Folders and files

Latest commit

History

Repository files navigation

Single GPU Mode

DataParallel Mode

DistributedDataParallel Mode (WSL/Linux)

Troubleshooting DDP

DDP vs DataParallel Benchmarks

Observations

Handling OOM and Auto-Wrap Policies

Checkpointing & Resume (DDP)

Saving

Resuming

FSDP Setup in This Repo

FSDP with Hugging Face Models & LoRA

Features

Key Components

Usage

Memory Requirements

Output

Notes

Accelerate Launch with Hugging Face Trainer

Features

Usage

Training Results

Training Progress Log

Notes on Warnings

Key Features

DeepSpeed Training with ZeRO-1

Features

Usage

Training Results Comparison

Training Progress Log (ZeRO-1)

Training Progress Log (ZeRO-2)

ZeRO Performance Benefits

ZeRO-1 vs Standard Accelerate:

ZeRO-2 vs Standard Accelerate:

Common Benefits:

DeepSpeed Configurations

ZeRO-1 Configuration:

ZeRO-2 Configuration:

Key Differences:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages