|
| 1 | +# GPU Instance Cost Analysis: P5 vs P6 for Autoresearch |
| 2 | + |
| 3 | +> Comparing H100 (P5), B200 (P6-B200), and B300 (P6-B300) for single-GPU ML experiment workloads on SageMaker Spot Training. |
| 4 | +
|
| 5 | +--- |
| 6 | + |
| 7 | +## 1. GPU Performance Specifications |
| 8 | + |
| 9 | +| GPU | Architecture | BF16 TFLOPS | VRAM | Memory BW | vs H100 | |
| 10 | +|-----|-------------|-------------|------|-----------|---------| |
| 11 | +| **H100** | Hopper | 990 | 80 GB | 3,350 GB/s | 1.0x | |
| 12 | +| **B200** | Blackwell | 2,250 | 180 GB | 8,000 GB/s | **2.27x** | |
| 13 | +| **B300** | Blackwell Ultra | ~3,375 | 288 GB | 8,000 GB/s | **~3.4x** | |
| 14 | + |
| 15 | +Sources: [NVIDIA Data Center GPU Specs](https://intuitionlabs.ai/articles/nvidia-data-center-gpu-specs), [B200 vs H100](https://www.civo.com/blog/comparing-nvidia-b200-and-h100), [B300 vs B200](https://verda.com/blog/nvidia-b300-vs-b200-complete-gpu-comparison-to-date) |
| 16 | + |
| 17 | +## 2. SageMaker Instance Configuration & Pricing |
| 18 | + |
| 19 | +### Available Instances |
| 20 | + |
| 21 | +| Instance | GPUs | GPU Type | Total VRAM | |
| 22 | +|----------|------|----------|-----------| |
| 23 | +| **ml.p5.4xlarge** | **1** | H100 | 80 GB | |
| 24 | +| ml.p5.48xlarge | 8 | H100 | 640 GB | |
| 25 | +| ml.p6-b200.48xlarge | 8 | B200 | 1,440 GB | |
| 26 | +| ml.p6-b300.48xlarge | 8 | B300 | 2,100 GB | |
| 27 | + |
| 28 | +**Critical note:** P6 instances are only available in 48xlarge (8 GPU) size. There is no single-GPU P6 option. |
| 29 | + |
| 30 | +### Pricing (us-west-2, estimated) |
| 31 | + |
| 32 | +| Instance | On-Demand/hr | Spot (~65% off) | Spot per GPU/hr | |
| 33 | +|----------|-------------|-----------------|-----------------| |
| 34 | +| **ml.p5.4xlarge** | $6.88 | ~$2.40 | **$2.40** | |
| 35 | +| ml.p6-b200.48xlarge | $113.93 | ~$39.88 | $4.98 | |
| 36 | +| ml.p6-b300.48xlarge | $142.42 | ~$49.85 | $6.23 | |
| 37 | + |
| 38 | +Sources: [p5.4xlarge pricing](https://instances.vantage.sh/aws/ec2/p5.4xlarge), [p6-b200 pricing](https://instances.vantage.sh/aws/ec2/p6-b200.48xlarge), [EC2 Capacity Blocks Pricing](https://aws.amazon.com/ec2/capacityblocks/pricing/) |
| 39 | + |
| 40 | +## 3. The Core Problem: P6 Has No Single-GPU Option |
| 41 | + |
| 42 | +Autoresearch is a **single-GPU workload**. Each experiment runs `train.py` on one GPU for 5 minutes. When using a P6 instance (8 GPUs), **7 out of 8 GPUs sit completely idle**, making P6 dramatically cost-inefficient unless the pipeline is redesigned. |
| 43 | + |
| 44 | +``` |
| 45 | +P5.4xlarge (1 GPU): |
| 46 | + [████ USED ████] → 100% utilization |
| 47 | +
|
| 48 | +P6-b200.48xlarge (8 GPUs, naive): |
| 49 | + [████ USED ████] → GPU 1: active |
| 50 | + [░░░░ IDLE ░░░░] → GPU 2: wasted |
| 51 | + [░░░░ IDLE ░░░░] → GPU 3: wasted |
| 52 | + [░░░░ IDLE ░░░░] → GPU 4: wasted |
| 53 | + [░░░░ IDLE ░░░░] → GPU 5: wasted |
| 54 | + [░░░░ IDLE ░░░░] → GPU 6: wasted |
| 55 | + [░░░░ IDLE ░░░░] → GPU 7: wasted |
| 56 | + [░░░░ IDLE ░░░░] → GPU 8: wasted |
| 57 | + → 12.5% utilization, 8x cost |
| 58 | +``` |
| 59 | + |
| 60 | +## 4. Cost Scenarios for 100 Experiments |
| 61 | + |
| 62 | +### Scenario A: 1 GPU = 1 Experiment (Current Pipeline Design) |
| 63 | + |
| 64 | +Each SageMaker Training Job runs one experiment on one GPU. |
| 65 | + |
| 66 | +| Instance | Training Time | Billable Time | Cost/Experiment | 100 Experiments | |
| 67 | +|----------|-------------|---------------|-----------------|-----------------| |
| 68 | +| **ml.p5.4xlarge** | 5 min | ~8 min | **$0.32** | **$32** | |
| 69 | +| ml.p6-b200.48xlarge | 2.2 min (2.27x faster) | ~5.2 min | $3.45 (8 GPU billed) | $345 | |
| 70 | +| ml.p6-b300.48xlarge | 1.5 min (3.4x faster) | ~4.5 min | $3.74 (8 GPU billed) | $374 | |
| 71 | + |
| 72 | +**Result:** P6 is **10x more expensive** than P5 due to 7 idle GPUs. |
| 73 | + |
| 74 | +### Scenario B: 8 GPU = 8 Experiments Simultaneously (P6 Optimized) |
| 75 | + |
| 76 | +Modified pipeline: launch 8 experiments per P6 instance, one per GPU. |
| 77 | + |
| 78 | +| Instance | Experiments/Job | Cost/Experiment | 100 Experiments | Wall Clock | |
| 79 | +|----------|----------------|-----------------|-----------------|------------| |
| 80 | +| **ml.p5.4xlarge** x10 parallel | 1 | $0.32 | **$32** | ~100 min | |
| 81 | +| ml.p6-b200.48xlarge x2 parallel | 8 | $0.43 | $43 | ~40 min | |
| 82 | +| ml.p6-b300.48xlarge x2 parallel | 8 | $0.47 | $47 | ~30 min | |
| 83 | + |
| 84 | +**Result:** P6 is 1.3-1.5x more expensive but **2.5-3.3x faster** in wall clock time. |
| 85 | + |
| 86 | +### Scenario C: Performance-Adjusted Cost (Tokens per Dollar) |
| 87 | + |
| 88 | +Since B200/B300 process more tokens in the same 5-minute budget, we compare cost per billion tokens processed: |
| 89 | + |
| 90 | +| GPU | Tokens in 5 min | Spot Cost (5 min) | Cost per B Tokens | |
| 91 | +|-----|----------------|-------------------|-------------------| |
| 92 | +| H100 (p5.4xl, 1 GPU) | ~500M | $0.20 | **$0.40/B** | |
| 93 | +| B200 (p6-b200, 1 of 8 GPUs) | ~1,135M | $3.32 | $2.93/B (7.3x worse) | |
| 94 | +| B200 (p6-b200, all 8 GPUs) | ~9,080M | $3.32 | **$0.37/B** (best) | |
| 95 | + |
| 96 | +**Result:** P6 is **most cost-efficient per token** only when all 8 GPUs are fully utilized. |
| 97 | + |
| 98 | +## 5. Summary & Recommendation |
| 99 | + |
| 100 | +### Decision Matrix |
| 101 | + |
| 102 | +| Priority | Best Choice | Reason | |
| 103 | +|----------|------------|--------| |
| 104 | +| **Cost efficiency** | **ml.p5.4xlarge** | Single GPU = zero waste, lowest $/experiment | |
| 105 | +| **Time efficiency** | ml.p6-b200.48xlarge | 8 parallel experiments per instance, 2.5x faster | |
| 106 | +| **Maximum throughput** | ml.p6-b300.48xlarge | 8 parallel + 3.4x per-GPU speedup | |
| 107 | +| **Cost + Performance** | **ml.p5.4xlarge** | Best balance for autoresearch workload | |
| 108 | + |
| 109 | +### Cost Summary Table |
| 110 | + |
| 111 | +| | P5 (H100x1) | P6-B200 (naive) | P6-B200 (8-parallel) | P6-B300 (8-parallel) | |
| 112 | +|---|---|---|---|---| |
| 113 | +| 100 experiments cost | **$32** | $345 | $43 | $47 | |
| 114 | +| Wall clock time | ~100 min | ~50 min | **~40 min** | **~30 min** | |
| 115 | +| Cost per experiment | **$0.32** | $3.45 | $0.43 | $0.47 | |
| 116 | +| GPU utilization | 100% | 12.5% | 100% | 100% | |
| 117 | + |
| 118 | +### Recommendation |
| 119 | + |
| 120 | +**Use ml.p5.4xlarge (H100 single GPU)** as the default for autoresearch: |
| 121 | + |
| 122 | +1. **10x cheaper** than naive P6 usage |
| 123 | +2. **Identical hardware** to the original autoresearch setup → fair comparison |
| 124 | +3. **Simple pipeline** — one GPU per job, no multi-GPU orchestration needed |
| 125 | +4. **Sufficient VRAM** (80 GB) for the 50M parameter model (~45 GB peak) |
| 126 | + |
| 127 | +Consider P6 only if: |
| 128 | +- Wall clock time is the top priority (board demo, deadline) |
| 129 | +- Pipeline is modified to run 8 experiments per instance |
| 130 | +- Budget is not a primary concern |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Appendix: P5 Spot Availability by Region |
| 135 | + |
| 136 | +| Region | ml.p5.4xlarge Spot | ml.p5.48xlarge Spot | |
| 137 | +|--------|-------------------|-------------------| |
| 138 | +| us-west-2 (Oregon) | Quota exists (request needed) | Quota exists | |
| 139 | +| ap-northeast-1 (Tokyo) | Quota exists (request needed) | Quota exists | |
| 140 | +| eu-west-2 (London) | Available (On-Demand & Spot) | Available | |
| 141 | +| ap-south-1 (Mumbai) | Available (On-Demand & Spot) | Available | |
| 142 | +| ap-southeast-3 (Jakarta) | Available (On-Demand & Spot) | Available | |
| 143 | +| sa-east-1 (Sao Paulo) | Available (On-Demand & Spot) | Available | |
| 144 | + |
| 145 | +Source: [AWS P5 Instance Announcement (Aug 2025)](https://aws.amazon.com/about-aws/whats-new/2025/08/p5-instance-nvidia-h100-gpu-sagemaker-training-processing-jobs/) |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +*Analysis date: 2026-03-28* |
| 150 | +*Prices are estimates based on publicly available data and may vary.* |
0 commit comments