-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Context
We are blocked on local hardware.
The 1B parameter proxy model requires ~19GB VRAM for a standard ZeRO-2 training step (Weights + Gradients + Optimizer States).
Local workstations (12GB) and Kaggle T4s (16GB/No BF16) are insufficient.
We need a short-term AWS allocation to validate the training harness before scaling to larger models.
Resource Strategy:
Instance: g5.48xlarge (8x A10G, 192GB VRAM).
Storage: 500 GB gp3 EBS Volume (Persistent storage for checkpoints).
Pricing Strategy: Spot Instance
Estimated Cost:
- Compute: ~$4.50/hr (Spot) × 10.5 Hours ≈ $47.25
- Storage: 500GB @ $0.08/GB-month (prorated for 1 day) ≈ $1.33
- Total Cap: $50.00
Justification:
We are requesting this $50 allocation to validate DeepSpeed efficiency, which is projected to reduce our final training costs by 40-60%.
-
Hardware Downscaling (Direct Savings):
Without DeepSpeed: Standard Data Parallelism (DDP) requires replicating the 70B optimizer state (840GB) on every device, forcing us to use premium A100-80GB clusters ($32/hr).
With DeepSpeed ZeRO-2: We shard memory across devices. This validation proves we can fit the training loop on cheaper A100-40GB or A10G nodes, effectively halving our hourly infrastructure cost. -
Throughput Maximization (Time = Money):
Inefficient sharding leads to GPU idle time (low MFU). By tuning gradient_accumulation_steps and micro_batch_size on this low-cost instance, we aim to push MFU from standard ~30% to >45%.
Impact: Increasing MFU by 15% reduces total training days proportionately, directly saving thousands in cloud credits. -
Prevention of "Oversizing":
Without this benchmark, we would be forced to over-provision the final cluster to be safe. This run gives us the precise VRAM-per-parameter metrics needed to provision exactly what is needed, and no more.
Validation Objectives
We are running train_shim.py to prove four specific technical hypotheses that cannot be tested locally:
Success Metric:
- DeepSpeed ZeRO-2 Sharding: Confirm that Optimizer States are correctly sharded across 8 GPUs via NCCL.
- Per-GPU VRAM usage should stabilize ~4-5GB for the 1B model (vs 12GB+ unsharded).
- BFloat16 Stability on Ampere: Verify that torch.bfloat16 runs without immediate NaN/Inf divergence on native hardware (A10G). Local T4s do not support this datatype.
- Ghost Parameter Filtering: Our architecture (MoE proxy) creates empty parameter containers. We need to verify our custom filter in train_shim.py prevents DeepSpeed from crashing on these zero-size tensors during the parameter grouping phase.
- Reversible Backend Compatibility: Verify that our custom autograd function (reversible_ops_midpoint.py) correctly reconstructs activations during the backward pass in a distributed setting without causing NCCL deadlocks.
- Validate Cost-Saving Sharding: Confirm ZeRO-2 reduces per-GPU memory footprint to expected theoretical limits (<16GB for 1.6B proxy), proving scalability for 70B.
- Benchmark MFU: Establish a "Tokens/Sec/Dollar" baseline to optimize the final cluster selection.
Execution Plan
Phase 1: Environment Setup (Hours 0-1)
Deploy train_shim.py and ds_config_zero2.json.
Install dependencies (Ninja, Triton, DeepSpeed).
Phase 2: Sharding Verification (Hours 1-4)
Run 1B model with batch_size=32 (saturate GPUs).
Metric: Log peak allocated memory per rank.
Metric: Monitor nvidia-smi for P2P communication spikes (validating NCCL).
Phase 3: Throughput Baselining (Hours 4-12)
Ramp up gradient_accumulation_steps to mimic 70B global batch sizes.
Metric: Record Tokens/Sec/GPU.
Metric: Record MFU (Model FLOPs Utilization).
Exit Criteria
[ ] Training runs for >1000 steps with no OOM.
[ ] Loss curve shows standard convergence (no immediate explosion).
[ ] Throughput metrics recorded for 1B model to extrapolate 70B compute requirements.
[ ] Instance terminated and cost report attached.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status