AWS Resource Request: DeepSpeed zero2 Architecture & Sharding Validation (1B Proxy)

### **Context**
We are blocked on local hardware. 
The 1B parameter proxy model requires ~19GB VRAM for a standard ZeRO-2 training step (Weights + Gradients + Optimizer States). 
Local workstations (12GB) and Kaggle T4s (16GB/No BF16) are insufficient.

We need a short-term AWS allocation to validate the training harness before scaling to larger models.

### **Resource Strategy:**

Instance: g5.48xlarge (8x A10G, 192GB VRAM).

Storage: 500 GB gp3 EBS Volume (Persistent storage for checkpoints).

Pricing Strategy: Spot Instance 

Estimated Cost:

- Compute: ~$4.50/hr (Spot) × 10.5 Hours ≈ $47.25
- Storage: 500GB @ $0.08/GB-month (prorated for 1 day) ≈ $1.33
- Total Cap: $50.00

### **Justification**: 
We are requesting this $50 allocation to validate DeepSpeed efficiency, which is projected to reduce our final training costs by 40-60%.

1. Hardware Downscaling (Direct Savings):
Without DeepSpeed: Standard Data Parallelism (DDP) requires replicating the 70B optimizer state (~840GB) on every device, forcing us to use premium A100-80GB clusters (~$32/hr).
With DeepSpeed ZeRO-2: We shard memory across devices. This validation proves we can fit the training loop on cheaper A100-40GB or A10G nodes, effectively halving our hourly infrastructure cost.

2. Throughput Maximization (Time = Money):
Inefficient sharding leads to GPU idle time (low MFU). By tuning gradient_accumulation_steps and micro_batch_size on this low-cost instance, we aim to push MFU from standard ~30% to >45%.
Impact: Increasing MFU by 15% reduces total training days proportionately, directly saving thousands in cloud credits.

3. Prevention of "Oversizing":

Without this benchmark, we would be forced to over-provision the final cluster to be safe. This run gives us the precise VRAM-per-parameter metrics needed to provision exactly what is needed, and no more.
### **Validation Objectives** 
We are running train_shim.py to prove four specific technical hypotheses that cannot be tested locally:


### **Success Metric:** 

1. DeepSpeed ZeRO-2 Sharding: Confirm that Optimizer States are correctly sharded across 8 GPUs via NCCL.
2. Per-GPU VRAM usage should stabilize ~4-5GB for the 1B model (vs 12GB+ unsharded).
3. BFloat16 Stability on Ampere: Verify that torch.bfloat16 runs without immediate NaN/Inf divergence on native hardware (A10G). Local T4s do not support this datatype.
4. Ghost Parameter Filtering: Our architecture (MoE proxy) creates empty parameter containers. We need to verify our custom filter in train_shim.py prevents DeepSpeed from crashing on these zero-size tensors during the parameter grouping phase.
5. Reversible Backend Compatibility: Verify that our custom autograd function (reversible_ops_midpoint.py) correctly reconstructs activations during the backward pass in a distributed setting without causing NCCL deadlocks.
6. Validate Cost-Saving Sharding: Confirm ZeRO-2 reduces per-GPU memory footprint to expected theoretical limits (<16GB for 1.6B proxy), proving scalability for 70B.
7. Benchmark MFU: Establish a "Tokens/Sec/Dollar" baseline to optimize the final cluster selection.

###  **Execution Plan**
Phase 1: Environment Setup (Hours 0-1)

Deploy train_shim.py and ds_config_zero2.json.

Install dependencies (Ninja, Triton, DeepSpeed).

Phase 2: Sharding Verification (Hours 1-4)

Run 1B model with batch_size=32 (saturate GPUs).

Metric: Log peak allocated memory per rank.

Metric: Monitor nvidia-smi for P2P communication spikes (validating NCCL).

Phase 3: Throughput Baselining (Hours 4-12)

Ramp up gradient_accumulation_steps to mimic 70B global batch sizes.

Metric: Record Tokens/Sec/GPU.

Metric: Record MFU (Model FLOPs Utilization).

### **Exit Criteria**
[ ] Training runs for >1000 steps with no OOM.

[ ] Loss curve shows standard convergence (no immediate explosion).

[ ] Throughput metrics recorded for 1B model to extrapolate 70B compute requirements.

[ ] Instance terminated and cost report attached.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Resource Request: DeepSpeed zero2 Architecture & Sharding Validation (1B Proxy) #466

Context

Resource Strategy:

Justification:

Validation Objectives

Success Metric:

Execution Plan

Exit Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWS Resource Request: DeepSpeed zero2 Architecture & Sharding Validation (1B Proxy) #466

Description

Context

Resource Strategy:

Justification:

Validation Objectives

Success Metric:

Execution Plan

Exit Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions