|
| 1 | +--- |
| 2 | +date: 2026-01-07 |
| 3 | +title: Why Your "Senior ML Engineer" Can't Deploy a 70B Model |
| 4 | +authors: |
| 5 | + - bet0x |
| 6 | +description: > |
| 7 | + Small models (≤30B) and large models (100B+) require fundamentally different infrastructure skills. Small models are an inference optimization problem—make one GPU go fast. Large models are a distributed systems problem—coordinate a cluster, manage memory as the primary constraint, and plan for multi-minute failure recovery. |
| 8 | +categories: |
| 9 | + - AI |
| 10 | + - GPU |
| 11 | + - Infrastructure |
| 12 | + - Machine Learning |
| 13 | + - Distributed Systems |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +# Why Your "Senior ML Engineer" Can't Deploy a 70B Model |
| 18 | + |
| 19 | +> **TL;DR:** Small models (≤30B) and large models (100B+) require fundamentally different infrastructure skills. Small models are an inference optimization problem—make one GPU go fast. Large models are a distributed systems problem—coordinate a cluster, manage memory as the primary constraint, and plan for multi-minute failure recovery. The threshold is around 70B parameters. Most ML engineers are trained for the first problem, not the second. |
| 20 | +
|
| 21 | +Here's something companies learn after burning through 6 figures in cloud credits: the skills for small models and large models are completely different. And most of your existing infra people can't do both. |
| 22 | + |
| 23 | +Once you cross ~70B parameters, your job description flips. You're not doing inference optimization anymore. You're doing distributed resource management. Also known as: the nightmare. |
| 24 | + |
| 25 | +<!-- more --> |
| 26 | + |
| 27 | +I've watched teams of excellent ML engineers—people who can write CUDA kernels in their sleep—completely fall apart when they try to scale past the threshold. Not because they're bad. Because the game changed and nobody told them. |
| 28 | + |
| 29 | +## At a Glance |
| 30 | + |
| 31 | +| Dimension | Small Models (≤30B) | Large Models (100B+) | |
| 32 | +|-----------|---------------------|----------------------| |
| 33 | +| **Hardware** | 1-2 GPUs | 8+ GPUs, sharded | |
| 34 | +| **Weights (FP16)** | ~14-60GB | 200-800GB+ | |
| 35 | +| **KV Cache** | Negligible (~2-4GB) | Dominates (100s of GB per request at long context) | |
| 36 | +| **GPU Utilization** | 85-95% achievable | 60-70% is "good" | |
| 37 | +| **Primary Constraint** | Compute (FLOPs) | Memory bandwidth + interconnect | |
| 38 | +| **Batching Strategy** | Aggressive = better throughput | Under-batch for stable p99 | |
| 39 | +| **Failure Recovery** | Seconds (restart pod) | Minutes (rehydrate weights + KV across cluster) | |
| 40 | +| **Parallelism** | Optional (data parallel) | Mandatory (tensor + pipeline + expert) | |
| 41 | +| **Hardware Abstraction** | Portable containers | Co-design with specific topology | |
| 42 | +| **Core Skills** | CUDA, kernel optimization | Distributed systems, scheduling | |
| 43 | +| **Cost Model** | ~Linear with tokens | Fixed costs dominate, memory-time matters | |
| 44 | + |
| 45 | +## The Memory Regime Shift (This Is The Big One) |
| 46 | + |
| 47 | +Let's talk numbers. |
| 48 | + |
| 49 | +**Small models (≤30B):** Your weights are ~60GB in FP16. Fits comfortably in 1-2 GPUs. KV cache? Negligible. Replication is cheap. Life is good. |
| 50 | + |
| 51 | +**Large models (100B-400B+):** Weights need to be sharded across 8+ H100s. And here's where it gets fun—KV cache *dominates* at long contexts. |
| 52 | + |
| 53 | +Do the math on a 405B model at 128k context: you're looking at **~400-500GB of KV cache per request** in FP16. One long session pins more memory than the weights themselves. |
| 54 | + |
| 55 | +```mermaid |
| 56 | +%%{init: {'theme': 'dark'}}%% |
| 57 | +graph LR |
| 58 | + subgraph "Small Model (24B)" |
| 59 | + SW[Weights ~48GB] |
| 60 | + SK[KV Cache ~2-4GB] |
| 61 | + end |
| 62 | + |
| 63 | + subgraph "Large Model (405B @ 128k ctx)" |
| 64 | + LW[Weights ~810GB] |
| 65 | + LK[KV Cache ~400-500GB] |
| 66 | + end |
| 67 | + |
| 68 | + style SW fill:#4a9eff,stroke:#333 |
| 69 | + style SK fill:#2d5a87,stroke:#333 |
| 70 | + style LW fill:#ff6b6b,stroke:#333 |
| 71 | + style LK fill:#ff4757,stroke:#333 |
| 72 | +``` |
| 73 | + |
| 74 | +Suddenly you're dealing with dynamic allocation, fragmentation, eviction policies. You're basically running a distributed database that happens to do matrix multiplication. |
| 75 | + |
| 76 | +## KV Cache Turns Inference Stateful |
| 77 | + |
| 78 | +For small models, KV is ephemeral—a few GB total. Stateless batching works great. |
| 79 | + |
| 80 | +For large models at 128k tokens? Hundreds of GB pinned *per request*. And this creates problems that'll make you question your career choices: |
| 81 | + |
| 82 | +- **Head-of-line blocking:** one chunky request starves the cluster |
| 83 | +- **Fairness explodes:** power users with long contexts crush everyone else |
| 84 | +- **"One bad request" syndrome:** a single user can tank your entire system |
| 85 | + |
| 86 | +```mermaid |
| 87 | +%%{init: {'theme': 'dark'}}%% |
| 88 | +flowchart TD |
| 89 | + subgraph "Small Model World" |
| 90 | + R1[Request] --> B1[Batch Together] |
| 91 | + R2[Request] --> B1 |
| 92 | + R3[Request] --> B1 |
| 93 | + B1 --> GPU1[GPU: Process All] |
| 94 | + GPU1 --> Done1[Done ✓] |
| 95 | + end |
| 96 | + |
| 97 | + subgraph "Large Model Reality" |
| 98 | + LR1[Short Request 4k] --> Q[Queue] |
| 99 | + LR2[Short Request 4k] --> Q |
| 100 | + LR3[Long Request 128k] --> Q |
| 101 | + Q --> |"128k consumes memory"| GPU2[GPU: Blocked] |
| 102 | + GPU2 --> |"Others wait..."| Sad[Latency Spike] |
| 103 | + end |
| 104 | +``` |
| 105 | + |
| 106 | +The modern fix is **PagedAttention** (what vLLM uses). It reduces fragmentation by 20-50%. But here's the kicker: memory bandwidth—not FLOPs—becomes the bottleneck. Your expensive compute sits idle waiting on HBM. You paid for 1000 TFLOPS and you're getting maybe 300 because data can't move fast enough. |
| 107 | + |
| 108 | +## Parallelism: Where Utilization Goes to Die |
| 109 | + |
| 110 | +**Small models:** Near-100% GPU utilization is easy. Crank up batch size, done. |
| 111 | + |
| 112 | +**Large models:** You need tensor parallelism + pipeline parallelism + (if MoE) expert parallelism. It's mandatory, not optional. And stragglers in any pipeline stage kill your tail latency. |
| 113 | + |
| 114 | +The reality according to vLLM/TensorRT-LLM reports: **60-70% utilization is considered "good"** at scale. You're trading efficiency just to make the thing physically fit. |
| 115 | + |
| 116 | +And before you say "just add more NVLink"—on H100 SXM, interconnect bandwidth caps at 900 GB/s per GPU (4th gen NVLink). Blackwell doubles that to 1.8 TB/s. But the fundamental constraint remains: you're moving tensors between 8 GPUs every forward pass, and interconnect saturation is a real ceiling regardless of generation. |
| 117 | + |
| 118 | +```mermaid |
| 119 | +%%{init: {'theme': 'dark'}}%% |
| 120 | +pie title GPU Utilization Reality |
| 121 | + "Actual Compute" : 65 |
| 122 | + "Waiting on Memory" : 20 |
| 123 | + "Sync Overhead" : 10 |
| 124 | + "Idle (Stragglers)" : 5 |
| 125 | +``` |
| 126 | + |
| 127 | +## Failure Modes Flip |
| 128 | + |
| 129 | +**Small models:** Something breaks? Restart the pod. 10 seconds later you're back. |
| 130 | + |
| 131 | +**Large models:** Rehydration takes *minutes*. You need to reload weights across the cluster, rebuild KV state. A partial node failure strands gigabytes of cached state somewhere in limbo. |
| 132 | + |
| 133 | +You need actual fault isolation strategies. Graceful degradation paths. This isn't "restart and pray" territory anymore. |
| 134 | + |
| 135 | +```mermaid |
| 136 | +%%{init: {'theme': 'dark'}}%% |
| 137 | +sequenceDiagram |
| 138 | + participant S as Small Model |
| 139 | + participant L as Large Model |
| 140 | + |
| 141 | + Note over S: Node fails |
| 142 | + S->>S: Restart (5 sec) |
| 143 | + S->>S: Reload weights (10 sec) |
| 144 | + S->>S: Back online ✓ |
| 145 | + |
| 146 | + Note over L: Node fails |
| 147 | + L->>L: Detect failure (30 sec) |
| 148 | + L->>L: Redistribute shards (2 min) |
| 149 | + L->>L: Rebuild KV state (1 min) |
| 150 | + L->>L: Rebalance cluster (1 min) |
| 151 | + L->>L: Maybe back online? |
| 152 | +``` |
| 153 | + |
| 154 | +## The Batching Paradox |
| 155 | + |
| 156 | +Every ML engineer's instinct: batch aggressively → more tokens/sec → better utilization → profit. |
| 157 | + |
| 158 | +Large models flip this on its head. Aggressive batching spikes memory consumption. Tail latency explodes. You start violating SLAs because one batch got too ambitious. |
| 159 | + |
| 160 | +The counterintuitive move: **intentionally under-batch** for predictable p99 latency. You're leaving throughput on the table to not blow up randomly. |
| 161 | + |
| 162 | +## Hardware Coupling: No More Abstractions |
| 163 | + |
| 164 | +**Small models:** Write your code, containerize, ship it anywhere. Kubernetes doesn't care. |
| 165 | + |
| 166 | +**Large models:** Co-design with hardware becomes mandatory. HBM capacity, NVLink topology, NUMA boundaries—these aren't nice-to-knows, they're architectural constraints. |
| 167 | + |
| 168 | +Your inference code is married to your iron whether you like it or not. |
| 169 | + |
| 170 | +## The Economics Nobody Wants to Talk About |
| 171 | + |
| 172 | +**Small models:** Cost scales roughly with tokens. Predictable. |
| 173 | + |
| 174 | +**Large models:** Fixed costs dominate. That cluster of H100s burns money whether you're at 100% or 10% utilization. Memory-time becomes the real unit—how much VRAM is pinned, for how long. |
| 175 | + |
| 176 | +Long context destroys your margins. A user running 128k context requests consumes 10-100x the resources of someone at 4k contexts. But they're probably on the same pricing tier. This is why every inference provider is moving to reserved tiers and context-length pricing. The economics force it. |
| 177 | + |
| 178 | +```mermaid |
| 179 | +%%{init: {'theme': 'dark'}}%% |
| 180 | +xychart-beta |
| 181 | + title "Cost Structure Comparison" |
| 182 | + x-axis ["Tokens", "Memory", "Fixed/Idle", "Long Context"] |
| 183 | + y-axis "% of Total Cost" 0 --> 100 |
| 184 | + bar [80, 10, 5, 5] |
| 185 | + bar [20, 25, 35, 20] |
| 186 | +``` |
| 187 | + |
| 188 | +*First bar: small models. Second bar: large models. Notice how "tokens processed" stops being the main cost driver.* |
| 189 | + |
| 190 | +## What "Comfortable" Actually Looks Like |
| 191 | + |
| 192 | +Here's my setup for a 24B model on 2x H100s. This is still the good life: |
| 193 | + |
| 194 | +```bash |
| 195 | +# The sweet spot - still manageable territory |
| 196 | +CMD="$CMD --gpu-memory-utilization 0.95" |
| 197 | +CMD="$CMD --max-num-seqs 512" # Can batch aggressively |
| 198 | +CMD="$CMD --max-model-len 32768" # 32k context, no sweat |
| 199 | +CMD="$CMD --kv-cache-dtype fp8" # Compression helps |
| 200 | +CMD="$CMD --tensor-parallel-size 2" # Just 2 GPUs, clean split |
| 201 | +CMD="$CMD --attention-backend flash_attn" |
| 202 | +CMD="$CMD --max-num-batched-tokens 16384" # Healthy batch size |
| 203 | +``` |
| 204 | + |
| 205 | +Two GPUs. Tensor parallelism across them. 95% memory utilization. I can tune for throughput with `--max-num-batched-tokens 16384` or drop to 8192 if I need lower TTFT. Prefix caching works. Async scheduling works. Everything is still *tractable*. |
| 206 | + |
| 207 | +```mermaid |
| 208 | +%%{init: {'theme': 'dark'}}%% |
| 209 | +flowchart LR |
| 210 | + subgraph "My 2x H100 Setup" |
| 211 | + subgraph GPU0["H100 #0"] |
| 212 | + W0[Weights Shard 0] |
| 213 | + KV0[KV Cache] |
| 214 | + end |
| 215 | + subgraph GPU1["H100 #1"] |
| 216 | + W1[Weights Shard 1] |
| 217 | + KV1[KV Cache] |
| 218 | + end |
| 219 | + GPU0 <-->|NVLink 4th gen| GPU1 |
| 220 | + end |
| 221 | + |
| 222 | + Request[Incoming Request] --> GPU0 |
| 223 | + GPU1 --> Response[Output Tokens] |
| 224 | + |
| 225 | + style GPU0 fill:#1a1a2e,stroke:#4a9eff |
| 226 | + style GPU1 fill:#1a1a2e,stroke:#4a9eff |
| 227 | +``` |
| 228 | + |
| 229 | +Scale this to 8 GPUs with pipeline parallelism and expert routing? Now you're juggling synchronization points, straggler mitigation, memory pressure from 4x the KV cache, and failure domains that span the whole cluster. |
| 230 | + |
| 231 | +## The Skill Gap Is Real |
| 232 | + |
| 233 | +Here's the uncomfortable truth that nobody wants to say out loud: |
| 234 | + |
| 235 | +**For small models:** You need ML engineers with CUDA expertise. Kernel optimization, quantization tricks, squeezing everything out of a single GPU. |
| 236 | + |
| 237 | +**For large models:** You need distributed systems engineers + compiler people. Scheduling, resource management, fault tolerance, consistency semantics. The overlap is smaller than you'd think. |
| 238 | + |
| 239 | +I've seen brilliant CUDA hackers completely lost when dealing with distributed state management. And I've seen infrastructure engineers who can build bulletproof distributed systems make rookie numerical computing mistakes. |
| 240 | + |
| 241 | +## When NOT to Scale |
| 242 | + |
| 243 | +Before you chase the biggest model you can afford, consider this: a well-tuned smaller model often beats a poorly-deployed large one. |
| 244 | + |
| 245 | +**Stick with smaller models when:** |
| 246 | + |
| 247 | +- **Your task is narrow and well-defined.** A fine-tuned 7B model for customer support classification will outperform a generic 70B model that wasn't trained for your domain. Smaller models with task-specific training consistently beat larger general-purpose models on focused tasks. |
| 248 | + |
| 249 | +- **Latency matters more than capability.** A 7B model can hit sub-100ms TTFT easily. A 70B model on 4 GPUs? You're looking at 200-500ms minimum, and that's before network overhead. For real-time applications, smaller is faster. |
| 250 | + |
| 251 | +- **You don't have the ops maturity.** Running a distributed inference cluster requires monitoring, alerting, graceful degradation, and on-call rotations that can handle multi-minute recovery scenarios. If your team isn't ready for that, you'll have worse uptime than a simpler deployment. |
| 252 | + |
| 253 | +- **Your traffic is unpredictable.** Small models scale horizontally with simple replication. Large models require careful capacity planning because you can't just "spin up another pod" when each replica needs 8 GPUs. |
| 254 | + |
| 255 | +- **Cost predictability matters.** With small models, you can estimate costs from traffic. With large models, you're paying for idle memory, long-context users subsidize short-context users, and your bill becomes harder to attribute. |
| 256 | + |
| 257 | +The industry has a bias toward "bigger is better." But the engineering reality is that complexity has costs, and those costs compound. Sometimes the right answer is a 13B model that actually works reliably. |
| 258 | + |
| 259 | +## The Takeaway |
| 260 | + |
| 261 | +The infrastructure cliff is real. The sooner you accept that crossing ~70B parameters is a qualitative shift—not just "bigger numbers"—the less money you'll burn figuring it out. |
| 262 | + |
| 263 | +Whether you're deploying a 7B model for a specific use case or scaling to hundreds of billions of parameters in a private environment, the path forward requires understanding these tradeoffs deeply. Not every team has the time or expertise to navigate this alone. |
| 264 | + |
| 265 | +If you're figuring out where to start with AI infrastructure—or you've already hit the wall and need help scaling—we work on exactly these problems at Rackspace. Check out our [Private AI solutions](https://www.rackspace.com/cloud/private/ai) to see how we can help. |
| 266 | + |
| 267 | +--- |
| 268 | + |
| 269 | +*This article is partially based on a comment in LinkedIn of unknown origin which was emailed to me and served as an inspiration for the writing.* |
| 270 | + |
0 commit comments