roboco-io
diff --git a/‎plugins/development/skills/sagemaker-spot-training/references/gpu-cost-analysis.md‎
Lines changed: 0 additions & 1 deletion b/‎plugins/development/skills/sagemaker-spot-training/references/gpu-cost-analysis.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎plugins/development/skills/sagemaker-spot-training/references/gpu-cost-analysis.md‎
Lines changed: 150 additions & 0 deletions b/‎plugins/development/skills/sagemaker-spot-training/references/gpu-cost-analysis.md‎
Lines changed: 150 additions & 0 deletions
diff --git a/‎plugins/development/skills/sagemaker-spot-training/references/insights.md‎
Lines changed: 0 additions & 1 deletion b/‎plugins/development/skills/sagemaker-spot-training/references/insights.md‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎plugins/development/skills/sagemaker-spot-training/references/insights.md‎
Lines changed: 104 additions & 0 deletions b/‎plugins/development/skills/sagemaker-spot-training/references/insights.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎plugins/development/skills/sagemaker-spot-training/references/spot-capacity-guide.md‎
Lines changed: 0 additions & 1 deletion b/‎plugins/development/skills/sagemaker-spot-training/references/spot-capacity-guide.md‎
Lines changed: 0 additions & 1 deletion
@@ -0,0 +1,150 @@
+# GPU Instance Cost Analysis: P5 vs P6 for Autoresearch
+
+> Comparing H100 (P5), B200 (P6-B200), and B300 (P6-B300) for single-GPU ML experiment workloads on SageMaker Spot Training.
+
+---
+
+## 1. GPU Performance Specifications
+
+| GPU | Architecture | BF16 TFLOPS | VRAM | Memory BW | vs H100 |
+|-----|-------------|-------------|------|-----------|---------|
+| **H100** | Hopper | 990 | 80 GB | 3,350 GB/s | 1.0x |
+| **B200** | Blackwell | 2,250 | 180 GB | 8,000 GB/s | **2.27x** |
+| **B300** | Blackwell Ultra | ~3,375 | 288 GB | 8,000 GB/s | **~3.4x** |
+
+Sources: [NVIDIA Data Center GPU Specs](https://intuitionlabs.ai/articles/nvidia-data-center-gpu-specs), [B200 vs H100](https://www.civo.com/blog/comparing-nvidia-b200-and-h100), [B300 vs B200](https://verda.com/blog/nvidia-b300-vs-b200-complete-gpu-comparison-to-date)
+
+## 2. SageMaker Instance Configuration & Pricing
+
+### Available Instances
+
+| Instance | GPUs | GPU Type | Total VRAM |
+|----------|------|----------|-----------|
+| **ml.p5.4xlarge** | **1** | H100 | 80 GB |
+| ml.p5.48xlarge | 8 | H100 | 640 GB |
+| ml.p6-b200.48xlarge | 8 | B200 | 1,440 GB |
+| ml.p6-b300.48xlarge | 8 | B300 | 2,100 GB |
+
+**Critical note:** P6 instances are only available in 48xlarge (8 GPU) size. There is no single-GPU P6 option.
+
+### Pricing (us-west-2, estimated)
+
+| Instance | On-Demand/hr | Spot (~65% off) | Spot per GPU/hr |
+|----------|-------------|-----------------|-----------------|
+| **ml.p5.4xlarge** | $6.88 | ~$2.40 | **$2.40** |
+| ml.p6-b200.48xlarge | $113.93 | ~$39.88 | $4.98 |
+| ml.p6-b300.48xlarge | $142.42 | ~$49.85 | $6.23 |
+
+Sources: [p5.4xlarge pricing](https://instances.vantage.sh/aws/ec2/p5.4xlarge), [p6-b200 pricing](https://instances.vantage.sh/aws/ec2/p6-b200.48xlarge), [EC2 Capacity Blocks Pricing](https://aws.amazon.com/ec2/capacityblocks/pricing/)
+
+## 3. The Core Problem: P6 Has No Single-GPU Option
+
+Autoresearch is a **single-GPU workload**. Each experiment runs `train.py` on one GPU for 5 minutes. When using a P6 instance (8 GPUs), **7 out of 8 GPUs sit completely idle**, making P6 dramatically cost-inefficient unless the pipeline is redesigned.
+
+```
+P5.4xlarge (1 GPU):
+  [████ USED ████]                          → 100% utilization
+
+P6-b200.48xlarge (8 GPUs, naive):
+  [████ USED ████]                          → GPU 1: active
+  [░░░░ IDLE ░░░░]                          → GPU 2: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 3: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 4: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 5: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 6: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 7: wasted
+  [░░░░ IDLE ░░░░]                          → GPU 8: wasted
+                                            → 12.5% utilization, 8x cost
+```
+
+## 4. Cost Scenarios for 100 Experiments
+
+### Scenario A: 1 GPU = 1 Experiment (Current Pipeline Design)
+
+Each SageMaker Training Job runs one experiment on one GPU.
+
+| Instance | Training Time | Billable Time | Cost/Experiment | 100 Experiments |
+|----------|-------------|---------------|-----------------|-----------------|
+| **ml.p5.4xlarge** | 5 min | ~8 min | **$0.32** | **$32** |
+| ml.p6-b200.48xlarge | 2.2 min (2.27x faster) | ~5.2 min | $3.45 (8 GPU billed) | $345 |
+| ml.p6-b300.48xlarge | 1.5 min (3.4x faster) | ~4.5 min | $3.74 (8 GPU billed) | $374 |
+
+**Result:** P6 is **10x more expensive** than P5 due to 7 idle GPUs.
+
+### Scenario B: 8 GPU = 8 Experiments Simultaneously (P6 Optimized)
+
+Modified pipeline: launch 8 experiments per P6 instance, one per GPU.
+
+| Instance | Experiments/Job | Cost/Experiment | 100 Experiments | Wall Clock |
+|----------|----------------|-----------------|-----------------|------------|
+| **ml.p5.4xlarge** x10 parallel | 1 | $0.32 | **$32** | ~100 min |
+| ml.p6-b200.48xlarge x2 parallel | 8 | $0.43 | $43 | ~40 min |
+| ml.p6-b300.48xlarge x2 parallel | 8 | $0.47 | $47 | ~30 min |
+
+**Result:** P6 is 1.3-1.5x more expensive but **2.5-3.3x faster** in wall clock time.
+
+### Scenario C: Performance-Adjusted Cost (Tokens per Dollar)
+
+Since B200/B300 process more tokens in the same 5-minute budget, we compare cost per billion tokens processed:
+
+| GPU | Tokens in 5 min | Spot Cost (5 min) | Cost per B Tokens |
+|-----|----------------|-------------------|-------------------|
+| H100 (p5.4xl, 1 GPU) | ~500M | $0.20 | **$0.40/B** |
+| B200 (p6-b200, 1 of 8 GPUs) | ~1,135M | $3.32 | $2.93/B (7.3x worse) |
+| B200 (p6-b200, all 8 GPUs) | ~9,080M | $3.32 | **$0.37/B** (best) |
+
+**Result:** P6 is **most cost-efficient per token** only when all 8 GPUs are fully utilized.
+
+## 5. Summary & Recommendation
+
+### Decision Matrix
+
+| Priority | Best Choice | Reason |
+|----------|------------|--------|
+| **Cost efficiency** | **ml.p5.4xlarge** | Single GPU = zero waste, lowest $/experiment |
+| **Time efficiency** | ml.p6-b200.48xlarge | 8 parallel experiments per instance, 2.5x faster |
+| **Maximum throughput** | ml.p6-b300.48xlarge | 8 parallel + 3.4x per-GPU speedup |
+| **Cost + Performance** | **ml.p5.4xlarge** | Best balance for autoresearch workload |
+
+### Cost Summary Table
+
+| | P5 (H100x1) | P6-B200 (naive) | P6-B200 (8-parallel) | P6-B300 (8-parallel) |
+|---|---|---|---|---|
+| 100 experiments cost | **$32** | $345 | $43 | $47 |
+| Wall clock time | ~100 min | ~50 min | **~40 min** | **~30 min** |
+| Cost per experiment | **$0.32** | $3.45 | $0.43 | $0.47 |
+| GPU utilization | 100% | 12.5% | 100% | 100% |
+
+### Recommendation
+
+**Use ml.p5.4xlarge (H100 single GPU)** as the default for autoresearch:
+
+1. **10x cheaper** than naive P6 usage
+2. **Identical hardware** to the original autoresearch setup → fair comparison
+3. **Simple pipeline** — one GPU per job, no multi-GPU orchestration needed
+4. **Sufficient VRAM** (80 GB) for the 50M parameter model (~45 GB peak)
+
+Consider P6 only if:
+- Wall clock time is the top priority (board demo, deadline)
+- Pipeline is modified to run 8 experiments per instance
+- Budget is not a primary concern
+
+---
+
+## Appendix: P5 Spot Availability by Region
+
+| Region | ml.p5.4xlarge Spot | ml.p5.48xlarge Spot |
+|--------|-------------------|-------------------|
+| us-west-2 (Oregon) | Quota exists (request needed) | Quota exists |
+| ap-northeast-1 (Tokyo) | Quota exists (request needed) | Quota exists |
+| eu-west-2 (London) | Available (On-Demand & Spot) | Available |
+| ap-south-1 (Mumbai) | Available (On-Demand & Spot) | Available |
+| ap-southeast-3 (Jakarta) | Available (On-Demand & Spot) | Available |
+| sa-east-1 (Sao Paulo) | Available (On-Demand & Spot) | Available |
+
+Source: [AWS P5 Instance Announcement (Aug 2025)](https://aws.amazon.com/about-aws/whats-new/2025/08/p5-instance-nvidia-h100-gpu-sagemaker-training-processing-jobs/)
+
+---
+
+*Analysis date: 2026-03-28*
+*Prices are estimates based on publicly available data and may vary.*
@@ -0,0 +1,104 @@
+# Serverless Autoresearch — Key Insights
+
+> Lessons learned from running autonomous ML experiments on SageMaker Spot Training.
+
+## 1. Spot Capacity Varies Dramatically by Region
+
+**Discovery:** The same instance type can have Spot placement score 1 (near-impossible) in one region and 9 (instant) in another.
+
+| Region | g7e Score | Result |
+|--------|----------|--------|
+| us-west-2 | 1-2 | Stuck "Starting" 30+ min |
+| us-east-1 | 9 | Allocated in ~2 min |
+
+**Rule:** Always run `aws ec2 get-spot-placement-scores` before choosing a region. See [Spot Capacity Guide](spot-capacity-guide.md).
+
+## 2. Larger Instances Can Be Cheaper on Spot
+
+**Discovery:** g7e.8xlarge ($0.93/hr) was cheaper than g7e.2xlarge ($0.94-$1.82/hr) in us-west-2 because larger instances have less Spot demand.
+
+**Rule:** Check Spot price history for all sizes — don't assume smaller = cheaper.
+
+## 3. DEVICE_BATCH_SIZE ≠ Token Throughput
+
+**Discovery:** Doubling DEVICE_BATCH_SIZE from 64 to 128 with the same TOTAL_BATCH_SIZE **worsened** val_bpb (1.065 → 1.081).
+
+**Why:** With TOTAL_BATCH_SIZE fixed at 2^19, larger DEVICE_BATCH_SIZE reduces gradient accumulation steps (4 → 2) without increasing total tokens processed. It just uses more VRAM for the same work.
+
+**Rule:** To increase throughput, increase TOTAL_BATCH_SIZE (more tokens per optimizer step), not just DEVICE_BATCH_SIZE.
+
+## 4. Flash Attention 3 is GPU-Architecture Specific
+
+**Discovery:** FA3 pre-compiled kernels only support Hopper (sm_90) and Ampere (sm_80/86). Ada Lovelace (sm_89, L40S) is **not supported**, causing runtime CUDA errors.
+
+**Solution:** Explicit compute capability check + PyTorch SDPA fallback. FA2 has community wheels for sm_89.
+
+**Impact:** SDPA gives ~20% MFU vs ~40% with FA3 — half the attention efficiency.
+
+## 5. SageMaker Startup Overhead is Significant
+
+**Discovery:** Each SageMaker Training Job has ~3 min startup overhead (instance allocation + container pull + data download + pip install). For 5-min training jobs, this is **60% overhead**.
+
+**Optimization paths:**
+- **Scale up:** Use multi-GPU instance, run N experiments on 1 job (amortize startup)
+- **Pre-install deps:** Bake packages into Docker image instead of pip install at runtime
+- **Warm pools:** SageMaker warm pools keep instances alive between jobs (but costs money)
+
+## 6. Quota Management is a First-Class Concern
+
+**Discovery:** GPU Spot quotas default to 0 for new instance types. g7e auto-approved within minutes; p5/p6 require manual review (CASE_OPENED, days).
+
+**Rule:** Request quotas in multiple regions upfront. g7e tends to auto-approve; p5+ needs lead time.
+
+## 7. SageMaker Profiler Doesn't Support All Instance Types
+
+**Discovery:** `ml.g7e` instances throw `ValidationException: Profiler is currently not supported` at job creation.
+
+**Fix:** Set `disable_profiler=True` in the PyTorch Estimator.
+
+## 8. The Parallel Evolution Approach Works
+
+**Validated:** The pipeline successfully generates candidates, submits parallel Spot jobs, collects results, and selects the best — all autonomously.
+
+**Cost efficiency:** 4 parallel experiments for $0.066 total, results in ~10 min wall clock (excluding Spot wait time in us-west-2).
+
+## 9. PyArrow Version Matters
+
+**Discovery:** The SageMaker DLC has pyarrow 23.x, but the local environment may have an older version causing `Repetition level histogram size mismatch` when reading parquet files.
+
+**Fix:** Ensure `pyarrow>=21.0.0` in requirements-train.txt.
+
+## 10. config.yaml Should Never Be in Git
+
+**Discovery:** config.yaml contains AWS role ARN, profile, and region — environment-specific and potentially sensitive.
+
+**Rule:** Gitignore config.yaml, provide config.yaml.example as template.
+
+## 11. Spot GPUs Are Valid Proxies for Large-Scale Training
+
+**Discovery:** Research confirms that hyperparameter optimization on cheaper GPUs (L40S) transfers well to expensive GPUs (H100) for production training.
+
+**What transfers:**
+- Optimizer choices (Muon vs AdamW) — relative rankings hold across hardware
+- Architecture decisions (depth, width, attention patterns) — hardware-independent
+- LR schedule shapes (cosine, warmup ratios) — direction transfers, absolute values need adjustment
+- Relative hyperparameter rankings — "A is better than B" conclusions are portable
+
+**What doesn't transfer:**
+- Absolute val_bpb values — depend on GPU throughput
+- Optimal batch sizes — depend on VRAM (48GB vs 80GB)
+- Memory-dependent optimizations — FA3 (Hopper only), FP8, etc.
+- Absolute learning rate values — need per-scale tuning without muP
+
+**Rule:** Use Spot for Phase 1 (hypothesis validation at $0.04/experiment), then apply winning architecture/optimizer choices to Phase 2 (full-scale training on H100). Use muP for direct LR transfer across scales.
+
+**References:**
+- [MLPerf BERT HPC Optimization (arXiv 2402.02447)](https://arxiv.org/pdf/2402.02447)
+- [Improving HPO with Checkpointed Weights (NVIDIA 2024)](https://research.nvidia.com/publication/2024-06_improving-hyperparameter-optimization-checkpointed-model-weights)
+- [muP Scaling (arXiv 2410.22854)](https://arxiv.org/html/2410.22854v3)
+
+## 12. DEVICE_BATCH_SIZE ≠ More Training
+
+**Discovery (Experiment #002):** Doubling DEVICE_BATCH_SIZE from 64 to 128 while keeping TOTAL_BATCH_SIZE=2^19 **worsened** val_bpb (1.065 → 1.081). It only reduced gradient accumulation steps (4 → 2) without increasing total tokens.
+
+**Rule:** To increase throughput, increase TOTAL_BATCH_SIZE. DEVICE_BATCH_SIZE only affects VRAM usage and gradient accumulation granularity.