Skip to content

aaravM123/distributed-llm-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed LLM Training

Train a simple CNN on CIFAR-10:

python train_toy.py --epochs 2 --batch_size 64 --learning_rate 0.001

Single GPU Mode

python train_toy.py --mode single

DataParallel Mode

python train_toy.py --mode dp

Splits batches across GPUs, but GPU:0 handles extra work.

DistributedDataParallel Mode (WSL/Linux)

torchrun --nproc_per_node=1 train_toy.py --mode ddp

Each GPU trains independently on different data shards and syncs gradients via NCCL.

Troubleshooting DDP

  • If you see "CUDA out of memory": lower batch_size or use gradient accumulation.
  • If training freezes: make sure you're using DistributedSampler and barriers where needed.
  • If you see NCCL warning: add dist.destroy_process_group() at the end of training.

DDP vs DataParallel Benchmarks

Hardware: 1× NVIDIA GPU
Dataset: AG News (1000 samples)
Model: bert-base-uncased
Batch Size: 8
Learning Rate: 5e-05
Epochs: 1

Metric Single GPU DataParallel DDP
Epoch Time (s) 26.14 25.86 28.20
Throughput (samples/s) 38.26 38.67 35.46
Peak Memory (MB) 2242 2242 3119

Observations

  • DataParallel shows slight performance improvement over Single GPU mode with ~1% faster throughput.
  • DDP has distributed overhead on single GPU, resulting in ~7% slower throughput than baseline.
  • Memory usage is consistent between Single GPU and DataParallel modes.
  • DDP uses significantly more memory (~39% increase) due to distributed process setup overhead.
  • Single GPU training remains most efficient for single-GPU setups with minimal overhead.

For multi-GPU scenarios, DDP would show substantial improvements due to true parallel gradient synchronization.

Handling OOM and Auto-Wrap Policies

  • Gradient accumulation prevents OOM on small GPUs.
  • Checkpoint/resume supports crash recovery.
  • Auto-wrap policy automatically shards large Transformer blocks (FSDP prep).
  • Use with: --auto_wrap

Checkpointing & Resume (DDP)

This training script supports automatic checkpointing and resume logic.

Saving

  • At the end of each epoch, rank 0 saves checkpoint.pt containing:
    • Current epoch number
    • Model weights
    • Optimizer state

Resuming

  • On restart, the script checks for checkpoint.pt.
  • If found, it loads model/optimizer state and resumes from the next epoch.
  • If not found, training starts fresh.

Example:

torchrun --nproc_per_node=1 checkpoint_ddp.py --mode ddp --epochs 3 --batch_size 16

FSDP Setup in This Repo

Script: train_fsdp_toy.py

This script demonstrates FSDP with a simple toy model:

  • Model: Tiny MLP (10 → 64 → 2)
  • Dataset: Random tensor data (64 samples)
  • Wraps model with FullyShardedDataParallel

Run:

torchrun --nproc_per_node=1 train_fsdp_toy.py --epochs 1 --batch_size 8

g## Installation

Install required dependencies:

pip install -r requirements.txt

Or install individually:

pip install torch transformers datasets peft accelerate

FSDP with Hugging Face Models & LoRA

Script: train_fsdp_hf.py

This script demonstrates FSDP training with real language models using Hugging Face transformers and LoRA (Low-Rank Adaptation):

Features

  • Model: Llama-2-7b-hf (7 billion parameters)
  • Fine-tuning: LoRA (Low-Rank Adaptation) for efficient parameter updates
  • Distributed Training: FSDP (Fully Sharded Data Parallel) for memory efficiency
  • Dataset: WikiText-2 small subset (200 samples)
  • Optimization: AdamW optimizer with mixed precision training

Key Components

  • LoRA Configuration:

    • Rank: 8
    • Alpha: 32
    • Target modules: ["q_proj", "v_proj"] (Llama-specific)
    • Task type: Causal Language Modeling
  • FSDP Setup:

    • Auto-wrap policy for LlamaDecoderLayer
    • TF32 support for Ampere GPUs
    • Proper device placement and memory management
  • Training Features:

    • Mixed precision with torch.cuda.amp.autocast
    • Gradient scaling for stability
    • Memory monitoring and checkpointing
    • Distributed process group management

Usage

Single GPU (NO_SHARD mode):

torchrun --nproc_per_node=1 train_fsdp_hf.py

Multi-GPU (FULL_SHARD mode):

torchrun --nproc_per_node=2 train_fsdp_hf.py

Memory Requirements

  • Minimum: 4GB+ GPU memory (for smaller models)
  • Recommended: 16GB+ GPU memory for Llama-2-7b
  • Multi-GPU: Distributes model across available GPUs

Output

  • Training progress with loss values
  • Peak memory usage statistics
  • Model checkpoint saved as fsdp_lora_checkpoint.pt

Notes

  • With 1 GPU, FSDP runs in NO_SHARD mode (no memory savings)
  • True sharding benefits require --nproc_per_node > 1
  • For testing on smaller GPUs, consider switching to TinyLlama or GPT-2 models

Accelerate Launch with Hugging Face Trainer

Script: deepspeed=accelerate/train_ds_hf.py

This script demonstrates training with Hugging Face's Accelerate library and Trainer API:

Features

  • Model: DistilBERT-base-uncased for binary classification
  • Fine-tuning: LoRA (Low-Rank Adaptation) with PEFT
  • Training: Hugging Face Trainer with Accelerate
  • Dataset: Synthetic movie review data (200 samples)
  • Optimization: AdamW optimizer with gradient accumulation

Usage

accelerate launch "deepspeed=accelerate/train_ds_hf.py"

Training Results

Hardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1
LoRA Rank: 8, Alpha: 32

Metric Value
Training Time (s) 10.01
Throughput (samples/s) 19.98
Steps per Second 2.50
Final Loss 0.461
Gradient Norm 1.51-1.62
Peak Memory GPU optimized

Training Progress Log

Loading tokenizer and model...
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Map: 100%|██████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 7794.29 examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
Training 1 epoch...

{'loss': 0.6157, 'grad_norm': 1.508832573890686, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3932, 'grad_norm': 1.615255355834961, 'learning_rate': 4.8e-05, 'epoch': 0.8}

100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:10<00:00, 2.50it/s]

{'train_runtime': 10.0089, 'train_samples_per_second': 19.982, 'train_steps_per_second': 2.498, 'train_loss': 0.4607763671875, 'epoch': 1.0}

Training complete!

Notes on Warnings

  • torch_dtype deprecation: Use dtype instead (cosmetic warning)
  • Uninitialized weights: Normal for classification head when adapting pre-trained model
  • 401 Unauthorized: Hugging Face Hub authentication issue (ignored safely)
  • Tokenizer deprecation: Future version will use processing_class instead

Key Features

  • LoRA Configuration: Targets ["q_lin", "v_lin"] modules for DistilBERT
  • Synthetic Data: Avoids Hugging Face Hub authentication issues
  • Mixed Precision: Disabled for stability (FP32 training)
  • Gradient Accumulation: 8 steps for effective batch size of 8
  • Windows Compatible: No DeepSpeed dependency

DeepSpeed Training with ZeRO-1

Script: deepspeed-accelerate/train_ds_hf.py

This script demonstrates training with DeepSpeed ZeRO-1 optimization using Hugging Face's Accelerate library:

Features

  • Model: DistilBERT-base-uncased for binary classification
  • Optimization: DeepSpeed ZeRO-1 (Zero Redundancy Optimizer Stage 1)
  • Training: Hugging Face Trainer with Accelerate + DeepSpeed
  • Dataset: Synthetic movie review data (200 samples)
  • Configuration: ZeRO-1 with FP16 mixed precision

Usage

cd deepspeed-accelerate
accelerate launch train_ds_hf.py

Training Results Comparison

Hardware: 1× NVIDIA GPU
Model: DistilBERT-base-uncased
Dataset: Synthetic movie reviews (200 samples)
Batch Size: 1 (with 8 gradient accumulation steps)
Learning Rate: 2e-4
Epochs: 1

Metric ZeRO-1 ZeRO-2 Standard Accelerate
Training Time (s) 5.90 6.41 10.01
Throughput (samples/s) 33.90 31.20 19.98
Steps per Second 4.24 3.90 2.50
Final Loss 0.424 0.619 0.461
Gradient Norm 1.35-1.44 0.52-0.57 1.51-1.62
Memory Optimization Optimizer partitioning Gradient + Optimizer partitioning None

Training Progress Log (ZeRO-1)

Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 5581.99 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.5851, 'grad_norm': 1.4390928745269775, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.3501, 'grad_norm': 1.352910041809082, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 5.9004, 'train_samples_per_second': 33.896, 'train_steps_per_second': 4.237, 'train_loss': 0.42357421875, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.24it/s]
Training complete!

Training Progress Log (ZeRO-2)

Loading tokenizer and model...
`torch_dtype` is deprecated! Use `dtype` instead!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 8341.23 examples/s]
Training 1 epoch...
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 8. Using DeepSpeed's value.
{'loss': 0.659, 'grad_norm': 0.5713843703269958, 'learning_rate': 0.00012800000000000002, 'epoch': 0.4}
{'loss': 0.5953, 'grad_norm': 0.5245606899261475, 'learning_rate': 4.8e-05, 'epoch': 0.8}
{'train_runtime': 6.4096, 'train_samples_per_second': 31.203, 'train_steps_per_second': 3.9, 'train_loss': 0.618642578125, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [00:06<00:00,  3.90it/s]
Training complete!
[rank0]:[W1024 10:39:03.109567454 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

ZeRO Performance Benefits

ZeRO-1 vs Standard Accelerate:

  • Faster Training: ~41% faster (5.90s vs 10.01s)
  • Higher Throughput: 70% improvement (33.90 vs 19.98 samples/s)
  • Optimizer State Partitioning: Reduces memory usage for optimizer states

ZeRO-2 vs Standard Accelerate:

  • Faster Training: ~36% faster (6.41s vs 10.01s)
  • Higher Throughput: 56% improvement (31.20 vs 19.98 samples/s)
  • Gradient + Optimizer Partitioning: More aggressive memory optimization

Common Benefits:

  • FP16 Mixed Precision: Enabled for faster computation
  • Gradient Accumulation: DeepSpeed config overrides Accelerate settings
  • Memory Efficiency: Both stages significantly reduce memory usage
  • Consistent Performance: Both ZeRO stages outperform standard training

DeepSpeed Configurations

ZeRO-1 Configuration:

{
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "fp16": { "enabled": true },
    "zero_allow_untested_optimizer": true
}

ZeRO-2 Configuration:

{
    "zero_optimization": {
        "stage": 2,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 8,
    "fp16": { "enabled": true },
    "zero_allow_untested_optimizer": true
}

Key Differences:

  • ZeRO-1: Partitions optimizer states across GPUs
  • ZeRO-2: Partitions both gradients and optimizer states
  • ZeRO-2: More memory efficient but slightly slower due to additional communication overhead
  • Both: Use FP16 mixed precision and gradient accumulation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors