The Definitive Guide to Building a Machine Learning Research Platform 🚀

This guide is for academic institutions and research labs ready to establish their first cohesive machine learning research platform. We've designed this guide for IT directors, principal investigators, and system architects tasked with building a unified environment where every researcher—from the first-year student to the seasoned postdoc—can train models efficiently. Whether you are starting with a single "under-the-desk" GPU server or scaling to a university-wide cluster, this guide provides the technical blueprints and strategic philosophy needed to turn raw hardware into a world-class discovery engine.

In this guide, we are not attempting to list every possibility but, rather, to offer the most common "tried and tested" configurations, with a bias toward modern, simple tooling that is open source and easy to maintain.

Note: This is a "living book" written on GitHub. We are looking for contributions from industry and academic experts to make recommendations. Found a mistake? Open an Issue or submit a Pull Request.

Configuration	Documentation
	The Single User AI Workstation (1 Node, 1 User) • Overview and Recommendations • Step-by-Step OS Installation: • Setup Ubuntu 22.04 Server with CUDA Support for a Single User ML Workstation • Setup Ubuntu with AMD ROCm Support for a Single User ML Workstation • Which Mac to Buy for Machine Learning
	The "Under-the-Desk" Server (1 Node, Multiuser) • Overview and Recommendations • Step-by-Step Install Instructions
	The "Closet Cluster" (2–5 Nodes) • Overview and Recommendations • Option 1: Step-by-Step Install Instructions – Building a Small SkyPilot + k3s Cluster (recommended) • Option 2: Build a k3s Cluster using Skypilot + Rancher
	The "Mac Silicon Cluster" • Overview and Recommendations
	The "University Cluster" (10–100 Nodes) • Overview and Recommendations • Option 1: Step-by-Step Install Instructions – Building a Large Kubernetes Cluster with Rancher + Skypilot • Option 2: Step-by-Step Install Instructions – Building a Large Slurm + Transformer Lab cluster
	Single Cloud or Hybrid Cloud Cluster • Overview and Recommendations

How They Built It: Real-World ML Clusters

Academic Clusters

Startup/Corporate Clusters

Have you built an ML cluster at your university or company? Are the docs public? Please share here.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
chapters		chapters
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Definitive Guide to Building a Machine Learning Research Platform 🚀

Table of Contents

Background

Configurations

How They Built It: Real-World ML Clusters

Academic Clusters

Startup/Corporate Clusters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

The Definitive Guide to Building a Machine Learning Research Platform 🚀

Table of Contents

Background

Configurations

How They Built It: Real-World ML Clusters

Academic Clusters

Startup/Corporate Clusters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages