This guide is for academic institutions and research labs ready to establish their first cohesive machine learning research platform. We've designed this guide for IT directors, principal investigators, and system architects tasked with building a unified environment where every researcher—from the first-year student to the seasoned postdoc—can train models efficiently. Whether you are starting with a single "under-the-desk" GPU server or scaling to a university-wide cluster, this guide provides the technical blueprints and strategic philosophy needed to turn raw hardware into a world-class discovery engine.
In this guide, we are not attempting to list every possibility but, rather, to offer the most common "tried and tested" configurations, with a bias toward modern, simple tooling that is open source and easy to maintain.
Note: This is a "living book" written on GitHub. We are looking for contributions from industry and academic experts to make recommendations. Found a mistake? Open an Issue or submit a Pull Request.
- Philosophy & Components
- Stanford HAI and Sherlock Ecosystems (SLURM Basics)
- MIT CSAIL (TIG Cluster Share)
- SLURM on NERSC
- MIT ORCD: DLCI and Commercial Resources
- SkyPilot at Shopify
- AI / ML at Uber
- 3 Principles for Building an ML Platform That Will Sustain Hypergrowth
- How did we build an AI/ML Platform at DoorDash from the ground up - Sudhir Tonse
- Griffin, ML Platform at Instacart
- Product Lessons from ML Home: Spotify’s One-Stop Shop for Machine Learning
- Monzo's Machine Learning Stack
Have you built an ML cluster at your university or company? Are the docs public? Please share here.







