Skip to content

Implement Native PyTorch AMP Support for Improved Training Efficiency #151

@HisStar

Description

@HisStar

Context

Mixed-precision training (FP16/BF16) significantly accelerates training while reducing memory usage. The repository currently has a placeholder use_amp flag, but the implementation is incomplete.

Detailed Analysis

  1. The --use_amp flag exists in run.py but appears unused in experiment files
  2. Modern NVIDIA GPUs (Volta, Turing, Ampere architectures) provide substantial speedups with mixed precision
  3. Implementation should leverage PyTorch's native torch.cuda.amp module

Implementation Recommendation

from torch.cuda.amp import autocast, GradScaler

# Initialize scaler once at beginning of training
scaler = GradScaler() if args.use_amp else None

# In training loop
with autocast(enabled=args.use_amp):
    outputs = model(batch_x, batch_y, batch_x_mark, batch_y_mark)
    loss = criterion(outputs, batch_y)
            
if args.use_amp:
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
else:
    loss.backward()
    optimizer.step()

Expected Outcomes

  • 1.5-3x training speedup depending on model size and GPU
  • ~50% memory reduction enabling larger models/batches
  • Minimal to no impact on final model accuracy

Respectfully submitted,
Quality Assurance Team

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions