A practical, first-order model (math + Python) to tell if you’re compute-, memory-, or network-bound—and how to pick the cheapest TB/s that hits your tokens/sec target.
Notebook: Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb (this repo). (GitHub)
Frontier-scale transformer training often hits the memory wall: step time is limited by how fast bytes move through HBM/GDDR, not by peak TFLOPs. This project provides a compact model—both in math and code—to:
- Diagnose whether a run is compute, memory, or network bound
- Estimate tokens/sec per GPU, GPUs needed for a target throughput, and cluster TB/s
- Compare hardware using $/TB/s/hour (cost per memory bandwidth), which often tracks throughput/$ better than TFLOPs/$ for large LLM training
- 📓 Notebook with the derivation + reference implementation
- 🧮 Equations for FLOPs/token, bytes/token (optimizer + activations), arithmetic intensity, and network-bound checks
- 🧰 Tunable knobs for FlashAttention, activation checkpointing, optimizer precision, global tokens/step, etc.
- 🧪 Example catalog entries for common GPUs (editable to your pricing/specs)
# 1) Clone
git clone https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth
cd Sizing-AI-Training-by-Cost-per-Memory-Bandwidth
# 2) (Recommended) Create an environment
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# 3) Install minimal deps for running the notebook
python -m pip install --upgrade pip jupyterlab
# 4) Launch and open the notebook
jupyter labThe notebook uses only the standard library (
dataclasses,math). If you add plots, installmatplotlibtoo.
- Fill in your run
- Model size
$N$ , layers$L$ , hidden size$d_{\text{model}}$ - Global tokens per step
$B_g$ (global batch × sequence length) - Optimizer traffic
$\alpha_{\text{opt}}$ (e.g., Adam bf16 ≈ 16–20 B/param/step) - Activation traffic coefficient
$c_{\text{act}}$ (lower with FlashAttention/fused kernels) - Recompute multiplier
$\gamma$ (1.1–1.4 with activation checkpointing)
-
Set hardware entries Usable TFLOPs (bf16/fp16), HBM TB/s, NIC Gb/s, and your $/GPU-hr.
-
Ask the two key questions
- What’s the bottleneck? (
compute,memory, ornetwork) - Among configs that aren’t network-bound, which gives the lowest $/TB/s·hr while meeting your tokens/sec target?
from dataclasses import dataclass
from math import ceil
@dataclass
class Hardware:
name: str
peak_flops_tflops: float
hbm_tbps: float
nic_gbps: float
price_per_gpu_hr: float
utilization: float = 0.75
@dataclass
class Model:
n_params: float; layers: int; d_model: int; bytes_per_elem: int = 2
@dataclass
class TrainingCfg:
k_flops_per_token: float = 6.0
recompute_mult: float = 1.0
alpha_opt_bytes_per_param: float = 16.0
c_act: float = 6.0
global_tokens_per_step: int = 512_000
bytes_per_grad_elem: int = 2
# ...functions for per_token_flops, per_token_hbm_bytes, per_token_net_bytes...
def tokens_per_sec_per_gpu(hw, model, train, dp_world_size=1):
# returns r_gpu, r_comp, r_mem, r_net, bound, intensity, machine_balance
...
def plan_cluster(hw, model, train, tokens_per_sec_target, dp_world_size=1):
# returns per-GPU rate, GPUs needed, $/hr, cluster HBM TB/s, $/TB/s·hr
...-
bound == "memory"→ You’re memory-bandwidth bound.- Reduce bytes/token: FlashAttention, fused kernels, 8-bit optimizers, bigger
$B_g$ (if stable). - Prefer hardware with better $/TB/s·hr (e.g., higher HBM BW per $).
- Reduce bytes/token: FlashAttention, fused kernels, 8-bit optimizers, bigger
-
bound == "network"→ All-reduce is the choke point.- Increase
$B_g$ , reduce pure DP (add TP/PP/ZeRO), overlap comms, or raise effective NIC BW (EFA/IB).
- Increase
-
bound == "compute"→ Great! Improve utilization and ensure you’re not secretly I/O-constrained.
- Compare H100 vs H200 vs L4 for a 70B model at target 200k tokens/sec.
- Flip to inference by setting
$\kappa\approx2$ ,$\alpha_{\text{opt}}=0$ , and modeling KV-cache bytes/token instead of activations. - Test the effect of global tokens/step on the network bound (watch
r_net).
- Helper CLI:
python plan.py --model 70b --target-tps 2e5 --hw h100,h200 - Plotting helpers (roofline view; $/TB/s vs design points)
- Inference variant (KV cache), MoE variant (active params), long-context attention presets
- Optional YAML config for reproducible comparisons
PRs and issues welcome! Ideas:
- Add measured bandwidth/utilization from your cluster
- Additional hardware profiles and real $/TB/s·hr snapshots
- Verified presets for FlashAttention, 8-bit optimizers, ZeRO, etc.
- Sizing_AI_Training_by_Cost_per_Memory_Bandwidth.ipynb — Main notebook with model and code.
- The KV Cache: What It Is, Why It Matters, and How to Size It for Modern LLMs — Deep dive notebook on KV cache sizing and implications for LLM inference.
- Roofline model (compute vs memory bound) — Williams et al., CACM (2009)
- FlashAttention (I/O-aware attention) — Dao et al., arXiv:2205.14135
- Megatron-LM scaling & comms patterns — Shoeybi et al., arXiv:1909.08053
- ZeRO optimizer sharding — Rajbhandari et al., SC’20 / arXiv:1910.02054
- 8-bit optimizers — Dettmers et al., arXiv:2110.02861
- NCCL collectives, EFA/libfabric plugin — NVIDIA & AWS docs
(See the blog post for a longer, linked bibliography.)
Specify a license for reuse (e.g., MIT or Apache-2.0). If you add a LICENSE file, link it here.
If this helped your team ship or save money, feel free to cite the repo/blog post or drop a star ⭐.
@misc{cost_per_memory_bandwidth,
title = {Sizing AI Training by Cost per Memory Bandwidth},
author = {Hodge, John},
year = {2025},
url = {https://github.com/jman4162/Sizing-AI-Training-by-Cost-per-Memory-Bandwidth}
}