Differential Privacy experimentations, and implementation of DP-ZeRO from DP-ZeRO by Bu et al. (2023)
All the experimentations and stand-alone implementation for individual noise addition mechanisms are included in the experimentations\ sub-directory.
The DP-ZeRO Trainer aims to implement differential privacy with zero redundancy distributed training, based on the research paper DP-ZeRO.
This implementation addresses the fundamental challenge of combining Differential Privacy (DP) with Fully Sharded Data Parallel (FSDP) training for large language models, providing both memory efficiency and privacy guarantees.
- "Zero redundancy distributed learning with differential privacy" (Bu et al., 2023)
- AWS Labs fast-differential-privacy library
- "Differentially Private Optimization on Large Model at Small Cost" (ICML 2023)
- "Deep Learning with Differential Privacy" (Abadi et al., 2016)
- Book-keeping algorithm for efficient gradient computation
- Mixed ghost norm trick for memory optimization
- Layer-wise clipping for reduced memory overhead
- FSDP compatibility with custom DP implementation
┌────────────────────────────────────────────────────────────┐
│ DP-ZeRO Trainer │
├────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Single Process │ │ Distributed │ │
│ │ │ │ │ │
│ │ • Opacus Engine │ │ • FSDP Sharding │ │
│ │ • Per-sample │ │ • Custom DP │ │
│ │ gradients │ │ • Memory Opt. │ │
│ │ • Better ε calc │ │ • Scalable │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┤
│ │ Smart Compatibility Layer │
│ │ • Auto-detects FSDP vs single process │
│ │ • Handles Opacus + FSDP incompatibility │
│ │ • Graceful fallbacks and error handling │
│ └─────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────┤
│ │ Core DP Implementation │
│ │ • Gradient clipping (per-sample or global) │
│ │ • Noise injection (calibrated) │
│ │ • Privacy accounting (RDP/simplified) │
│ │ • Mixed precision training (bf16, no loss scaling) │
│ └─────────────────────────────────────────────────────────┘
└────────────────────────────────────────────────────────────┘
@dataclass
class DPConfig:
clip_norm: float = 1.0 # Gradient clipping threshold
noise_multiplier: float = 1.0 # DP noise scale
target_delta: float = 1e-5 # Privacy parameter δ
target_epsilon: float = 8.0 # Privacy parameter ε
clipping_mode: str = "MixOpt" # Mixed optimization (book-keeping)
clipping_style: str = "layer" # Layer-wise clipping
clipping_fn: str = "automatic" # Automatic threshold selectionKey Features:
- MixOpt clipping: Implements the book-keeping algorithm from DP-ZeRO paper
- Layer-wise clipping: Reduces memory overhead compared to per-sample clipping
- Automatic thresholding: Eliminates manual hyperparameter tuning
@dataclass
class TrainerConfig:
global_batch_size: int = 64 # Effective batch size across all GPUs
micro_batch_size: int = 8 # Per-GPU batch size
lr: float = 3e-4 # Learning rate
max_steps: int = 500 # Training steps
log_every: int = 50 # Logging frequency
bf16: bool = True # Use bf16 (recommended for DP)
epochs: int = 3 # Training epochs
sample_size: int = 50000 # Dataset size (for privacy accounting)Key Features:
- Gradient accumulation: Automatically computed based on global vs micro batch sizes
- bf16 precision: No loss scaling (as per DP-ZeRO paper recommendations)
- Privacy-aware batching: Handles batch size constraints for DP
The SimpleDPEngine provides a fallback differential privacy implementation when Opacus is incompatible with FSDP.
def clip_and_accumulate_gradients(self):
"""
Implements the book-keeping algorithm from DP-ZeRO paper
"""
if isinstance(self.model, FSDP):
# FSDP: Use global gradient clipping + noise
total_norm = sqrt(sum(||grad||² for grad in gradients))
if total_norm > clip_norm:
scale_factor = clip_norm / total_norm
gradients *= scale_factor
# Add calibrated noise
for grad in gradients:
noise = Normal(0, noise_multiplier * clip_norm)
grad += noise
else:
# Per-sample gradient clipping (original DP-SGD)
per_sample_norms = [||grad_i|| for each sample i]
clip_coeffs = clamp(clip_norm / per_sample_norms, max=1.0)
clipped_grads = sum(grad_i * clip_coeffs[i] for i in batch)
noise = Normal(0, noise_multiplier * clip_norm)
final_grad = clipped_grads + noisePrivacy Accounting:
def compute_epsilon(self, delta):
"""Simplified RDP epsilon computation"""
q = batch_size / dataset_size # Sampling probability
steps = training_steps
# Simplified bound: ε ≈ q * steps / noise_multiplier²
return min(q * steps / (noise_multiplier ** 2), 50.0)Problem: Opacus requires per-sample gradient computation, but FSDP shards parameters across devices, breaking the per-sample gradient tracking hooks.
Solution:
- Auto-detection: Check if model is FSDP instance
- Smart routing: Use Opacus for single-process, custom DP for FSDP
- Graceful fallback: Handle all failure modes automatically
Problem: Poisson sampling (better privacy) is incompatible with gradient accumulation.
Solution:
needs_accumulation = global_batch_size > (micro_batch_size * world_size)
use_poisson = not needs_accumulation
# Automatically switch between:
# - Poisson sampling: Better privacy amplification
# - Uniform sampling: Compatible with gradient accumulationFrom DP-ZeRO Paper:
- No loss scaling for DP training (causes privacy leakage)
- bf16 throughout instead of fp16 (better numerical stability)
- Direct precision casting instead of autocast
# Traditional approach (problematic for DP):
with torch.amp.autocast():
loss = model(input)
scaled_loss = scaler.scale(loss) # Privacy leak!
# DP-ZeRO approach:
if bf16_enabled:
# Model handles precision internally, no external scaling
loss = model(input.to(torch.bfloat16))from dp_zero_trainer import DPZeROTrainer, DPConfig, TrainerConfig
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader
# Initialize model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Create dataloader
dataset = YourDataset()
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, drop_last=True)
# Configure privacy
dp_config = DPConfig(
clip_norm=1.0,
noise_multiplier=1.1,
target_epsilon=8.0,
target_delta=1e-5
)
# Configure training
trainer_config = TrainerConfig(
global_batch_size=32,
micro_batch_size=8,
max_steps=1000,
sample_size=len(dataset)
)
# Create and run trainer
trainer = DPZeROTrainer(model, dataloader, dp_config, trainer_config)
trainer.train()# Uses Opacus PrivacyEngine (better privacy accounting)
python dp_zero_trainer.pyCharacteristics:
- Precise privacy accounting with RDP
- Per-sample gradient computation
- Memory intensive (requires smaller models or more GPU memory)
# Uses FSDP + Custom DP (memory efficient)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 dp_zero_trainer.pyCharacteristics:
- Memory efficient parameter sharding
- Scalable to multiple GPUs
- FSDP compatibility with custom DP implementation
| Component | Single-Process + Opacus | Distributed + FSDP + Custom DP |
|---|---|---|
| Per-sample gradients | Full computation | Avoided |
| Parameter sharding | No sharding | FSDP sharding |
| Memory scaling | O(batch_size × model_size) | O(model_size / num_gpus) |
| Suitable for | Small models, large memory | Large models, multiple GPUs |
- Single-process: ~2-3x slower than non-DP (due to per-sample gradients)
- Distributed: ~1.3-1.5x slower than non-DP (simplified DP implementation)
- Communication overhead: Minimal (FSDP handles efficiently)
- Opacus (single-process): Precise RDP accounting
- Custom (distributed): Simplified bounds (conservative estimates)
| Parameter | Lower Value | Higher Value | Recommendation |
|---|---|---|---|
clip_norm |
Less utility, more privacy | More utility, less privacy | 0.5-2.0 for NLP |
noise_multiplier |
Less privacy, more utility | More privacy, less utility | 0.8-1.5 typically |
target_epsilon |
Stronger privacy | Weaker privacy | 2-10 for practical use |
batch_size |
Less privacy amplification | More privacy amplification | Larger is better for DP |
# For memory-constrained environments
trainer_config = TrainerConfig(
global_batch_size=16, # Smaller effective batch
micro_batch_size=4, # Smaller per-GPU batch
bf16=True, # Mixed precision
)
dp_config = DPConfig(
clipping_style="layer", # More memory efficient than per-sample
clipping_mode="MixOpt", # Book-keeping algorithm
)- FSDP + Opacus incompatibility: Fundamental PyTorch/Opacus limitation
- Memory requirements: DP training inherently memory-intensive
- Simplified accounting: Custom DP uses conservative privacy bounds
- Model support: Some advanced architectures may need modifications
# For large models - use distributed training
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 script.py
# For memory issues - reduce batch size
TrainerConfig(global_batch_size=16, micro_batch_size=4)
# For better privacy - use parameter-efficient training
# (LoRA, adapters, bias-only fine-tuning)1. CUDA Out of Memory (Single-Process)
torch.OutOfMemoryError: CUDA out of memory
Solution: Use distributed training or smaller model
torchrun --nproc_per_node=2 dp_zero_trainer.py2. FSDP + Opacus Incompatibility
AttributeError: 'Tensor' object has no attribute '_forward_counter'
Solution: Automatically handled - trainer switches to custom DP
3. Sampling + Gradient Accumulation Conflict
ValueError: Poisson sampling is not compatible with grad accumulation
Solution: Automatically switches to uniform sampling