Skip to content

Commit ad754e2

Browse files
serithemageclaude
andcommitted
fix: replace symlinks with actual files in sagemaker-spot-training references
Symlinks pointed to absolute local paths, breaking CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7e25e29 commit ad754e2

3 files changed

Lines changed: 404 additions & 3 deletions

File tree

plugins/development/skills/sagemaker-spot-training/references/gpu-cost-analysis.md

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# GPU Instance Cost Analysis: P5 vs P6 for Autoresearch
2+
3+
> Comparing H100 (P5), B200 (P6-B200), and B300 (P6-B300) for single-GPU ML experiment workloads on SageMaker Spot Training.
4+
5+
---
6+
7+
## 1. GPU Performance Specifications
8+
9+
| GPU | Architecture | BF16 TFLOPS | VRAM | Memory BW | vs H100 |
10+
|-----|-------------|-------------|------|-----------|---------|
11+
| **H100** | Hopper | 990 | 80 GB | 3,350 GB/s | 1.0x |
12+
| **B200** | Blackwell | 2,250 | 180 GB | 8,000 GB/s | **2.27x** |
13+
| **B300** | Blackwell Ultra | ~3,375 | 288 GB | 8,000 GB/s | **~3.4x** |
14+
15+
Sources: [NVIDIA Data Center GPU Specs](https://intuitionlabs.ai/articles/nvidia-data-center-gpu-specs), [B200 vs H100](https://www.civo.com/blog/comparing-nvidia-b200-and-h100), [B300 vs B200](https://verda.com/blog/nvidia-b300-vs-b200-complete-gpu-comparison-to-date)
16+
17+
## 2. SageMaker Instance Configuration & Pricing
18+
19+
### Available Instances
20+
21+
| Instance | GPUs | GPU Type | Total VRAM |
22+
|----------|------|----------|-----------|
23+
| **ml.p5.4xlarge** | **1** | H100 | 80 GB |
24+
| ml.p5.48xlarge | 8 | H100 | 640 GB |
25+
| ml.p6-b200.48xlarge | 8 | B200 | 1,440 GB |
26+
| ml.p6-b300.48xlarge | 8 | B300 | 2,100 GB |
27+
28+
**Critical note:** P6 instances are only available in 48xlarge (8 GPU) size. There is no single-GPU P6 option.
29+
30+
### Pricing (us-west-2, estimated)
31+
32+
| Instance | On-Demand/hr | Spot (~65% off) | Spot per GPU/hr |
33+
|----------|-------------|-----------------|-----------------|
34+
| **ml.p5.4xlarge** | $6.88 | ~$2.40 | **$2.40** |
35+
| ml.p6-b200.48xlarge | $113.93 | ~$39.88 | $4.98 |
36+
| ml.p6-b300.48xlarge | $142.42 | ~$49.85 | $6.23 |
37+
38+
Sources: [p5.4xlarge pricing](https://instances.vantage.sh/aws/ec2/p5.4xlarge), [p6-b200 pricing](https://instances.vantage.sh/aws/ec2/p6-b200.48xlarge), [EC2 Capacity Blocks Pricing](https://aws.amazon.com/ec2/capacityblocks/pricing/)
39+
40+
## 3. The Core Problem: P6 Has No Single-GPU Option
41+
42+
Autoresearch is a **single-GPU workload**. Each experiment runs `train.py` on one GPU for 5 minutes. When using a P6 instance (8 GPUs), **7 out of 8 GPUs sit completely idle**, making P6 dramatically cost-inefficient unless the pipeline is redesigned.
43+
44+
```
45+
P5.4xlarge (1 GPU):
46+
[████ USED ████] → 100% utilization
47+
48+
P6-b200.48xlarge (8 GPUs, naive):
49+
[████ USED ████] → GPU 1: active
50+
[░░░░ IDLE ░░░░] → GPU 2: wasted
51+
[░░░░ IDLE ░░░░] → GPU 3: wasted
52+
[░░░░ IDLE ░░░░] → GPU 4: wasted
53+
[░░░░ IDLE ░░░░] → GPU 5: wasted
54+
[░░░░ IDLE ░░░░] → GPU 6: wasted
55+
[░░░░ IDLE ░░░░] → GPU 7: wasted
56+
[░░░░ IDLE ░░░░] → GPU 8: wasted
57+
→ 12.5% utilization, 8x cost
58+
```
59+
60+
## 4. Cost Scenarios for 100 Experiments
61+
62+
### Scenario A: 1 GPU = 1 Experiment (Current Pipeline Design)
63+
64+
Each SageMaker Training Job runs one experiment on one GPU.
65+
66+
| Instance | Training Time | Billable Time | Cost/Experiment | 100 Experiments |
67+
|----------|-------------|---------------|-----------------|-----------------|
68+
| **ml.p5.4xlarge** | 5 min | ~8 min | **$0.32** | **$32** |
69+
| ml.p6-b200.48xlarge | 2.2 min (2.27x faster) | ~5.2 min | $3.45 (8 GPU billed) | $345 |
70+
| ml.p6-b300.48xlarge | 1.5 min (3.4x faster) | ~4.5 min | $3.74 (8 GPU billed) | $374 |
71+
72+
**Result:** P6 is **10x more expensive** than P5 due to 7 idle GPUs.
73+
74+
### Scenario B: 8 GPU = 8 Experiments Simultaneously (P6 Optimized)
75+
76+
Modified pipeline: launch 8 experiments per P6 instance, one per GPU.
77+
78+
| Instance | Experiments/Job | Cost/Experiment | 100 Experiments | Wall Clock |
79+
|----------|----------------|-----------------|-----------------|------------|
80+
| **ml.p5.4xlarge** x10 parallel | 1 | $0.32 | **$32** | ~100 min |
81+
| ml.p6-b200.48xlarge x2 parallel | 8 | $0.43 | $43 | ~40 min |
82+
| ml.p6-b300.48xlarge x2 parallel | 8 | $0.47 | $47 | ~30 min |
83+
84+
**Result:** P6 is 1.3-1.5x more expensive but **2.5-3.3x faster** in wall clock time.
85+
86+
### Scenario C: Performance-Adjusted Cost (Tokens per Dollar)
87+
88+
Since B200/B300 process more tokens in the same 5-minute budget, we compare cost per billion tokens processed:
89+
90+
| GPU | Tokens in 5 min | Spot Cost (5 min) | Cost per B Tokens |
91+
|-----|----------------|-------------------|-------------------|
92+
| H100 (p5.4xl, 1 GPU) | ~500M | $0.20 | **$0.40/B** |
93+
| B200 (p6-b200, 1 of 8 GPUs) | ~1,135M | $3.32 | $2.93/B (7.3x worse) |
94+
| B200 (p6-b200, all 8 GPUs) | ~9,080M | $3.32 | **$0.37/B** (best) |
95+
96+
**Result:** P6 is **most cost-efficient per token** only when all 8 GPUs are fully utilized.
97+
98+
## 5. Summary & Recommendation
99+
100+
### Decision Matrix
101+
102+
| Priority | Best Choice | Reason |
103+
|----------|------------|--------|
104+
| **Cost efficiency** | **ml.p5.4xlarge** | Single GPU = zero waste, lowest $/experiment |
105+
| **Time efficiency** | ml.p6-b200.48xlarge | 8 parallel experiments per instance, 2.5x faster |
106+
| **Maximum throughput** | ml.p6-b300.48xlarge | 8 parallel + 3.4x per-GPU speedup |
107+
| **Cost + Performance** | **ml.p5.4xlarge** | Best balance for autoresearch workload |
108+
109+
### Cost Summary Table
110+
111+
| | P5 (H100x1) | P6-B200 (naive) | P6-B200 (8-parallel) | P6-B300 (8-parallel) |
112+
|---|---|---|---|---|
113+
| 100 experiments cost | **$32** | $345 | $43 | $47 |
114+
| Wall clock time | ~100 min | ~50 min | **~40 min** | **~30 min** |
115+
| Cost per experiment | **$0.32** | $3.45 | $0.43 | $0.47 |
116+
| GPU utilization | 100% | 12.5% | 100% | 100% |
117+
118+
### Recommendation
119+
120+
**Use ml.p5.4xlarge (H100 single GPU)** as the default for autoresearch:
121+
122+
1. **10x cheaper** than naive P6 usage
123+
2. **Identical hardware** to the original autoresearch setup → fair comparison
124+
3. **Simple pipeline** — one GPU per job, no multi-GPU orchestration needed
125+
4. **Sufficient VRAM** (80 GB) for the 50M parameter model (~45 GB peak)
126+
127+
Consider P6 only if:
128+
- Wall clock time is the top priority (board demo, deadline)
129+
- Pipeline is modified to run 8 experiments per instance
130+
- Budget is not a primary concern
131+
132+
---
133+
134+
## Appendix: P5 Spot Availability by Region
135+
136+
| Region | ml.p5.4xlarge Spot | ml.p5.48xlarge Spot |
137+
|--------|-------------------|-------------------|
138+
| us-west-2 (Oregon) | Quota exists (request needed) | Quota exists |
139+
| ap-northeast-1 (Tokyo) | Quota exists (request needed) | Quota exists |
140+
| eu-west-2 (London) | Available (On-Demand & Spot) | Available |
141+
| ap-south-1 (Mumbai) | Available (On-Demand & Spot) | Available |
142+
| ap-southeast-3 (Jakarta) | Available (On-Demand & Spot) | Available |
143+
| sa-east-1 (Sao Paulo) | Available (On-Demand & Spot) | Available |
144+
145+
Source: [AWS P5 Instance Announcement (Aug 2025)](https://aws.amazon.com/about-aws/whats-new/2025/08/p5-instance-nvidia-h100-gpu-sagemaker-training-processing-jobs/)
146+
147+
---
148+
149+
*Analysis date: 2026-03-28*
150+
*Prices are estimates based on publicly available data and may vary.*

plugins/development/skills/sagemaker-spot-training/references/insights.md

Lines changed: 0 additions & 1 deletion
This file was deleted.
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Serverless Autoresearch — Key Insights
2+
3+
> Lessons learned from running autonomous ML experiments on SageMaker Spot Training.
4+
5+
## 1. Spot Capacity Varies Dramatically by Region
6+
7+
**Discovery:** The same instance type can have Spot placement score 1 (near-impossible) in one region and 9 (instant) in another.
8+
9+
| Region | g7e Score | Result |
10+
|--------|----------|--------|
11+
| us-west-2 | 1-2 | Stuck "Starting" 30+ min |
12+
| us-east-1 | 9 | Allocated in ~2 min |
13+
14+
**Rule:** Always run `aws ec2 get-spot-placement-scores` before choosing a region. See [Spot Capacity Guide](spot-capacity-guide.md).
15+
16+
## 2. Larger Instances Can Be Cheaper on Spot
17+
18+
**Discovery:** g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($0.94-$1.82/hr) in us-west-2 because larger instances have less Spot demand.
19+
20+
**Rule:** Check Spot price history for all sizes — don't assume smaller = cheaper.
21+
22+
## 3. DEVICE_BATCH_SIZE ≠ Token Throughput
23+
24+
**Discovery:** Doubling DEVICE_BATCH_SIZE from 64 to 128 with the same TOTAL_BATCH_SIZE **worsened** val_bpb (1.065 → 1.081).
25+
26+
**Why:** With TOTAL_BATCH_SIZE fixed at 2^19, larger DEVICE_BATCH_SIZE reduces gradient accumulation steps (4 → 2) without increasing total tokens processed. It just uses more VRAM for the same work.
27+
28+
**Rule:** To increase throughput, increase TOTAL_BATCH_SIZE (more tokens per optimizer step), not just DEVICE_BATCH_SIZE.
29+
30+
## 4. Flash Attention 3 is GPU-Architecture Specific
31+
32+
**Discovery:** FA3 pre-compiled kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89, L40S) is **not supported**, causing runtime CUDA errors.
33+
34+
**Solution:** Explicit compute capability check + PyTorch SDPA fallback. FA2 has community wheels for sm_89.
35+
36+
**Impact:** SDPA gives ~20% MFU vs ~40% with FA3 — half the attention efficiency.
37+
38+
## 5. SageMaker Startup Overhead is Significant
39+
40+
**Discovery:** Each SageMaker Training Job has ~3 min startup overhead (instance allocation + container pull + data download + pip install). For 5-min training jobs, this is **60% overhead**.
41+
42+
**Optimization paths:**
43+
- **Scale up:** Use multi-GPU instance, run N experiments on 1 job (amortize startup)
44+
- **Pre-install deps:** Bake packages into Docker image instead of pip install at runtime
45+
- **Warm pools:** SageMaker warm pools keep instances alive between jobs (but costs money)
46+
47+
## 6. Quota Management is a First-Class Concern
48+
49+
**Discovery:** GPU Spot quotas default to 0 for new instance types. g7e auto-approved within minutes; p5/p6 require manual review (CASE_OPENED, days).
50+
51+
**Rule:** Request quotas in multiple regions upfront. g7e tends to auto-approve; p5+ needs lead time.
52+
53+
## 7. SageMaker Profiler Doesn't Support All Instance Types
54+
55+
**Discovery:** `ml.g7e` instances throw `ValidationException: Profiler is currently not supported` at job creation.
56+
57+
**Fix:** Set `disable_profiler=True` in the PyTorch Estimator.
58+
59+
## 8. The Parallel Evolution Approach Works
60+
61+
**Validated:** The pipeline successfully generates candidates, submits parallel Spot jobs, collects results, and selects the best — all autonomously.
62+
63+
**Cost efficiency:** 4 parallel experiments for $0.066 total, results in ~10 min wall clock (excluding Spot wait time in us-west-2).
64+
65+
## 9. PyArrow Version Matters
66+
67+
**Discovery:** The SageMaker DLC has pyarrow 23.x, but the local environment may have an older version causing `Repetition level histogram size mismatch` when reading parquet files.
68+
69+
**Fix:** Ensure `pyarrow>=21.0.0` in requirements-train.txt.
70+
71+
## 10. config.yaml Should Never Be in Git
72+
73+
**Discovery:** config.yaml contains AWS role ARN, profile, and region — environment-specific and potentially sensitive.
74+
75+
**Rule:** Gitignore config.yaml, provide config.yaml.example as template.
76+
77+
## 11. Spot GPUs Are Valid Proxies for Large-Scale Training
78+
79+
**Discovery:** Research confirms that hyperparameter optimization on cheaper GPUs (L40S) transfers well to expensive GPUs (H100) for production training.
80+
81+
**What transfers:**
82+
- Optimizer choices (Muon vs AdamW) — relative rankings hold across hardware
83+
- Architecture decisions (depth, width, attention patterns) — hardware-independent
84+
- LR schedule shapes (cosine, warmup ratios) — direction transfers, absolute values need adjustment
85+
- Relative hyperparameter rankings — "A is better than B" conclusions are portable
86+
87+
**What doesn't transfer:**
88+
- Absolute val_bpb values — depend on GPU throughput
89+
- Optimal batch sizes — depend on VRAM (48GB vs 80GB)
90+
- Memory-dependent optimizations — FA3 (Hopper only), FP8, etc.
91+
- Absolute learning rate values — need per-scale tuning without muP
92+
93+
**Rule:** Use Spot for Phase 1 (hypothesis validation at $0.04/experiment), then apply winning architecture/optimizer choices to Phase 2 (full-scale training on H100). Use muP for direct LR transfer across scales.
94+
95+
**References:**
96+
- [MLPerf BERT HPC Optimization (arXiv 2402.02447)](https://arxiv.org/pdf/2402.02447)
97+
- [Improving HPO with Checkpointed Weights (NVIDIA 2024)](https://research.nvidia.com/publication/2024-06_improving-hyperparameter-optimization-checkpointed-model-weights)
98+
- [muP Scaling (arXiv 2410.22854)](https://arxiv.org/html/2410.22854v3)
99+
100+
## 12. DEVICE_BATCH_SIZE ≠ More Training
101+
102+
**Discovery (Experiment #002):** Doubling DEVICE_BATCH_SIZE from 64 to 128 while keeping TOTAL_BATCH_SIZE=2^19 **worsened** val_bpb (1.065 → 1.081). It only reduced gradient accumulation steps (4 → 2) without increasing total tokens.
103+
104+
**Rule:** To increase throughput, increase TOTAL_BATCH_SIZE. DEVICE_BATCH_SIZE only affects VRAM usage and gradient accumulation granularity.

plugins/development/skills/sagemaker-spot-training/references/spot-capacity-guide.md

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)