Skip to content

AWS Resource Request: DeepSpeed zero2 Architecture & Sharding Validation (1B Proxy) #466

@abi2024

Description

@abi2024

Context

We are blocked on local hardware.
The 1B parameter proxy model requires ~19GB VRAM for a standard ZeRO-2 training step (Weights + Gradients + Optimizer States).
Local workstations (12GB) and Kaggle T4s (16GB/No BF16) are insufficient.

We need a short-term AWS allocation to validate the training harness before scaling to larger models.

Resource Strategy:

Instance: g5.48xlarge (8x A10G, 192GB VRAM).

Storage: 500 GB gp3 EBS Volume (Persistent storage for checkpoints).

Pricing Strategy: Spot Instance

Estimated Cost:

  • Compute: ~$4.50/hr (Spot) × 10.5 Hours ≈ $47.25
  • Storage: 500GB @ $0.08/GB-month (prorated for 1 day) ≈ $1.33
  • Total Cap: $50.00

Justification:

We are requesting this $50 allocation to validate DeepSpeed efficiency, which is projected to reduce our final training costs by 40-60%.

  1. Hardware Downscaling (Direct Savings):
    Without DeepSpeed: Standard Data Parallelism (DDP) requires replicating the 70B optimizer state (840GB) on every device, forcing us to use premium A100-80GB clusters ($32/hr).
    With DeepSpeed ZeRO-2: We shard memory across devices. This validation proves we can fit the training loop on cheaper A100-40GB or A10G nodes, effectively halving our hourly infrastructure cost.

  2. Throughput Maximization (Time = Money):
    Inefficient sharding leads to GPU idle time (low MFU). By tuning gradient_accumulation_steps and micro_batch_size on this low-cost instance, we aim to push MFU from standard ~30% to >45%.
    Impact: Increasing MFU by 15% reduces total training days proportionately, directly saving thousands in cloud credits.

  3. Prevention of "Oversizing":

Without this benchmark, we would be forced to over-provision the final cluster to be safe. This run gives us the precise VRAM-per-parameter metrics needed to provision exactly what is needed, and no more.

Validation Objectives

We are running train_shim.py to prove four specific technical hypotheses that cannot be tested locally:

Success Metric:

  1. DeepSpeed ZeRO-2 Sharding: Confirm that Optimizer States are correctly sharded across 8 GPUs via NCCL.
  2. Per-GPU VRAM usage should stabilize ~4-5GB for the 1B model (vs 12GB+ unsharded).
  3. BFloat16 Stability on Ampere: Verify that torch.bfloat16 runs without immediate NaN/Inf divergence on native hardware (A10G). Local T4s do not support this datatype.
  4. Ghost Parameter Filtering: Our architecture (MoE proxy) creates empty parameter containers. We need to verify our custom filter in train_shim.py prevents DeepSpeed from crashing on these zero-size tensors during the parameter grouping phase.
  5. Reversible Backend Compatibility: Verify that our custom autograd function (reversible_ops_midpoint.py) correctly reconstructs activations during the backward pass in a distributed setting without causing NCCL deadlocks.
  6. Validate Cost-Saving Sharding: Confirm ZeRO-2 reduces per-GPU memory footprint to expected theoretical limits (<16GB for 1.6B proxy), proving scalability for 70B.
  7. Benchmark MFU: Establish a "Tokens/Sec/Dollar" baseline to optimize the final cluster selection.

Execution Plan

Phase 1: Environment Setup (Hours 0-1)

Deploy train_shim.py and ds_config_zero2.json.

Install dependencies (Ninja, Triton, DeepSpeed).

Phase 2: Sharding Verification (Hours 1-4)

Run 1B model with batch_size=32 (saturate GPUs).

Metric: Log peak allocated memory per rank.

Metric: Monitor nvidia-smi for P2P communication spikes (validating NCCL).

Phase 3: Throughput Baselining (Hours 4-12)

Ramp up gradient_accumulation_steps to mimic 70B global batch sizes.

Metric: Record Tokens/Sec/GPU.

Metric: Record MFU (Model FLOPs Utilization).

Exit Criteria

[ ] Training runs for >1000 steps with no OOM.

[ ] Loss curve shows standard convergence (no immediate explosion).

[ ] Throughput metrics recorded for 1B model to extrapolate 70B compute requirements.

[ ] Instance terminated and cost report attached.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

ToDo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions