You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.**Memory bandwidth matters for sparse operations**: The P100 (not available on Modal) outperforms T4 by ~2x on Kaggle due to HBM2 memory (~732 GB/s) vs GDDR6 (~320 GB/s).
268
+
269
+
2.**Significant overhead at low epochs**: With only 200 epochs, much of the runtime is fixed overhead:
270
+
- Git clone and `uv sync` (~2-3 min)
271
+
- HuggingFace data download (~1 min)
272
+
- Loading Microsimulation and building sparse matrix (~3-4 min, CPU-bound)
273
+
274
+
3.**GPU choice depends on epoch count**:
275
+
-**< 500 epochs**: Use T4 (cheapest, overhead dominates)
276
+
-**500-2000 epochs**: A100-40GB may break even
277
+
-**> 2000 epochs**: A100 likely more cost-effective as training dominates
278
+
279
+
4.**Available Modal GPUs** (by memory bandwidth):
280
+
- T4: 320 GB/s, $0.000164/sec
281
+
- L4: 300 GB/s, $0.000222/sec
282
+
- A10: 600 GB/s, $0.000306/sec
283
+
- L40S: 864 GB/s, $0.000542/sec
284
+
- A100-40GB: 1,555 GB/s, $0.000583/sec
285
+
- A100-80GB: 2,039 GB/s, $0.000694/sec
286
+
- H100: 3,350 GB/s, $0.001097/sec
287
+
288
+
### Output
289
+
290
+
Weights are saved locally to `calibration_weights.npy` (configurable via `--output` flag).
0 commit comments