Skip to content

shreyashkar-ml/differential_privacy

Repository files navigation

Differential Privacy experimentations, and implementation of DP-ZeRO from DP-ZeRO by Bu et al. (2023)

All the experimentations and stand-alone implementation for individual noise addition mechanisms are included in the experimentations\ sub-directory.

DP-ZeRO Trainer

The DP-ZeRO Trainer aims to implement differential privacy with zero redundancy distributed training, based on the research paper DP-ZeRO.

This implementation addresses the fundamental challenge of combining Differential Privacy (DP) with Fully Sharded Data Parallel (FSDP) training for large language models, providing both memory efficiency and privacy guarantees.

Research Background

Key Papers & References

  • "Zero redundancy distributed learning with differential privacy" (Bu et al., 2023)
  • AWS Labs fast-differential-privacy library
  • "Differentially Private Optimization on Large Model at Small Cost" (ICML 2023)
  • "Deep Learning with Differential Privacy" (Abadi et al., 2016)

Core Innovations

  1. Book-keeping algorithm for efficient gradient computation
  2. Mixed ghost norm trick for memory optimization
  3. Layer-wise clipping for reduced memory overhead
  4. FSDP compatibility with custom DP implementation

Architecture Overview

┌────────────────────────────────────────────────────────────┐
│                    DP-ZeRO Trainer                         │
├────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐                  │
│  │ Single Process  │  │  Distributed    │                  │
│  │                 │  │                 │                  │
│  │ • Opacus Engine │  │ • FSDP Sharding │                  │
│  │ • Per-sample    │  │ • Custom DP     │                  │
│  │   gradients     │  │ • Memory Opt.   │                  │
│  │ • Better ε calc │  │ • Scalable      │                  │
│  └─────────────────┘  └─────────────────┘                  │
│           │                     │                          │
│           ▼                     ▼                          │
│  ┌─────────────────────────────────────────────────────────┤
│  │           Smart Compatibility Layer                     │
│  │  • Auto-detects FSDP vs single process                  │
│  │  • Handles Opacus + FSDP incompatibility                │
│  │  • Graceful fallbacks and error handling                │
│  └─────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────┤
│  │              Core DP Implementation                     │
│  │  • Gradient clipping (per-sample or global)             │
│  │  • Noise injection (calibrated)                         │
│  │  • Privacy accounting (RDP/simplified)                  │
│  │  • Mixed precision training (bf16, no loss scaling)     │
│  └─────────────────────────────────────────────────────────┘
└────────────────────────────────────────────────────────────┘

Core Components

1. DPConfig - Privacy Configuration

@dataclass
class DPConfig:
    clip_norm: float = 1.0                    # Gradient clipping threshold
    noise_multiplier: float = 1.0             # DP noise scale
    target_delta: float = 1e-5                # Privacy parameter δ
    target_epsilon: float = 8.0               # Privacy parameter ε
    clipping_mode: str = "MixOpt"             # Mixed optimization (book-keeping)
    clipping_style: str = "layer"             # Layer-wise clipping
    clipping_fn: str = "automatic"            # Automatic threshold selection

Key Features:

  • MixOpt clipping: Implements the book-keeping algorithm from DP-ZeRO paper
  • Layer-wise clipping: Reduces memory overhead compared to per-sample clipping
  • Automatic thresholding: Eliminates manual hyperparameter tuning

2. TrainerConfig - Training Configuration

@dataclass
class TrainerConfig:
    global_batch_size: int = 64               # Effective batch size across all GPUs
    micro_batch_size: int = 8                 # Per-GPU batch size
    lr: float = 3e-4                          # Learning rate
    max_steps: int = 500                      # Training steps
    log_every: int = 50                       # Logging frequency
    bf16: bool = True                         # Use bf16 (recommended for DP)
    epochs: int = 3                           # Training epochs
    sample_size: int = 50000                  # Dataset size (for privacy accounting)

Key Features:

  • Gradient accumulation: Automatically computed based on global vs micro batch sizes
  • bf16 precision: No loss scaling (as per DP-ZeRO paper recommendations)
  • Privacy-aware batching: Handles batch size constraints for DP

3. SimpleDPEngine - Custom DP Implementation

The SimpleDPEngine provides a fallback differential privacy implementation when Opacus is incompatible with FSDP.

Core Algorithm (Per-Sample Gradient Clipping):

def clip_and_accumulate_gradients(self):
    """
    Implements the book-keeping algorithm from DP-ZeRO paper
    """
    if isinstance(self.model, FSDP):
        # FSDP: Use global gradient clipping + noise
        total_norm = sqrt(sum(||grad||² for grad in gradients))
        if total_norm > clip_norm:
            scale_factor = clip_norm / total_norm
            gradients *= scale_factor
        
        # Add calibrated noise
        for grad in gradients:
            noise = Normal(0, noise_multiplier * clip_norm)
            grad += noise
    else:
        # Per-sample gradient clipping (original DP-SGD)
        per_sample_norms = [||grad_i|| for each sample i]
        clip_coeffs = clamp(clip_norm / per_sample_norms, max=1.0)
        
        clipped_grads = sum(grad_i * clip_coeffs[i] for i in batch)
        noise = Normal(0, noise_multiplier * clip_norm)
        final_grad = clipped_grads + noise

Privacy Accounting:

def compute_epsilon(self, delta):
    """Simplified RDP epsilon computation"""
    q = batch_size / dataset_size           # Sampling probability
    steps = training_steps
    # Simplified bound: ε ≈ q * steps / noise_multiplier²
    return min(q * steps / (noise_multiplier ** 2), 50.0)

Key Technical Innovations

1. FSDP + Opacus Incompatibility Resolution

Problem: Opacus requires per-sample gradient computation, but FSDP shards parameters across devices, breaking the per-sample gradient tracking hooks.

Solution:

  • Auto-detection: Check if model is FSDP instance
  • Smart routing: Use Opacus for single-process, custom DP for FSDP
  • Graceful fallback: Handle all failure modes automatically

2. Sampling Strategy Optimization

Problem: Poisson sampling (better privacy) is incompatible with gradient accumulation.

Solution:

needs_accumulation = global_batch_size > (micro_batch_size * world_size)
use_poisson = not needs_accumulation

# Automatically switch between:
# - Poisson sampling: Better privacy amplification
# - Uniform sampling: Compatible with gradient accumulation

3. Memory-Optimized Mixed Precision

From DP-ZeRO Paper:

  • No loss scaling for DP training (causes privacy leakage)
  • bf16 throughout instead of fp16 (better numerical stability)
  • Direct precision casting instead of autocast
# Traditional approach (problematic for DP):
with torch.amp.autocast():
    loss = model(input)
    scaled_loss = scaler.scale(loss)  # Privacy leak!

# DP-ZeRO approach:
if bf16_enabled:
    # Model handles precision internally, no external scaling
    loss = model(input.to(torch.bfloat16))

Usage Examples

Basic Usage

from dp_zero_trainer import DPZeROTrainer, DPConfig, TrainerConfig
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader

# Initialize model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Create dataloader
dataset = YourDataset()
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, drop_last=True)

# Configure privacy
dp_config = DPConfig(
    clip_norm=1.0,
    noise_multiplier=1.1,
    target_epsilon=8.0,
    target_delta=1e-5
)

# Configure training
trainer_config = TrainerConfig(
    global_batch_size=32,
    micro_batch_size=8,
    max_steps=1000,
    sample_size=len(dataset)
)

# Create and run trainer
trainer = DPZeROTrainer(model, dataloader, dp_config, trainer_config)
trainer.train()

Single-Process Training (Smaller Models)

# Uses Opacus PrivacyEngine (better privacy accounting)
python dp_zero_trainer.py

Characteristics:

  • Precise privacy accounting with RDP
  • Per-sample gradient computation
  • Memory intensive (requires smaller models or more GPU memory)

Distributed Training (Large Models)

# Uses FSDP + Custom DP (memory efficient)
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 dp_zero_trainer.py

Characteristics:

  • Memory efficient parameter sharding
  • Scalable to multiple GPUs
  • FSDP compatibility with custom DP implementation

Performance Characteristics

Memory Usage

Component Single-Process + Opacus Distributed + FSDP + Custom DP
Per-sample gradients Full computation Avoided
Parameter sharding No sharding FSDP sharding
Memory scaling O(batch_size × model_size) O(model_size / num_gpus)
Suitable for Small models, large memory Large models, multiple GPUs

Training Speed

  • Single-process: ~2-3x slower than non-DP (due to per-sample gradients)
  • Distributed: ~1.3-1.5x slower than non-DP (simplified DP implementation)
  • Communication overhead: Minimal (FSDP handles efficiently)

Privacy Accounting Accuracy

  • Opacus (single-process): Precise RDP accounting
  • Custom (distributed): Simplified bounds (conservative estimates)

Configuration Deep Dive

Privacy-Performance Trade-offs

Parameter Lower Value Higher Value Recommendation
clip_norm Less utility, more privacy More utility, less privacy 0.5-2.0 for NLP
noise_multiplier Less privacy, more utility More privacy, less utility 0.8-1.5 typically
target_epsilon Stronger privacy Weaker privacy 2-10 for practical use
batch_size Less privacy amplification More privacy amplification Larger is better for DP

Memory Optimization Settings

# For memory-constrained environments
trainer_config = TrainerConfig(
    global_batch_size=16,      # Smaller effective batch
    micro_batch_size=4,        # Smaller per-GPU batch
    bf16=True,                 # Mixed precision
)

dp_config = DPConfig(
    clipping_style="layer",    # More memory efficient than per-sample
    clipping_mode="MixOpt",    # Book-keeping algorithm
)

Limitations & Trade-offs

Current Limitations

  1. FSDP + Opacus incompatibility: Fundamental PyTorch/Opacus limitation
  2. Memory requirements: DP training inherently memory-intensive
  3. Simplified accounting: Custom DP uses conservative privacy bounds
  4. Model support: Some advanced architectures may need modifications

Recommended Mitigations

# For large models - use distributed training
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 script.py

# For memory issues - reduce batch size
TrainerConfig(global_batch_size=16, micro_batch_size=4)

# For better privacy - use parameter-efficient training
# (LoRA, adapters, bias-only fine-tuning)

Troubleshooting

Common Issues

1. CUDA Out of Memory (Single-Process)

torch.OutOfMemoryError: CUDA out of memory

Solution: Use distributed training or smaller model

torchrun --nproc_per_node=2 dp_zero_trainer.py

2. FSDP + Opacus Incompatibility

AttributeError: 'Tensor' object has no attribute '_forward_counter'

Solution: Automatically handled - trainer switches to custom DP

3. Sampling + Gradient Accumulation Conflict

ValueError: Poisson sampling is not compatible with grad accumulation

Solution: Automatically switches to uniform sampling

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors