Skip to content

Latest commit

 

History

History
49 lines (34 loc) · 5.15 KB

File metadata and controls

49 lines (34 loc) · 5.15 KB

The Definitive Guide to Building a Machine Learning Research Platform 🚀

This guide is for academic institutions and research labs ready to establish their first cohesive machine learning research platform. We've designed this guide for IT directors, principal investigators, and system architects tasked with building a unified environment where every researcher—from the first-year student to the seasoned postdoc—can train models efficiently. Whether you are starting with a single "under-the-desk" GPU server or scaling to a university-wide cluster, this guide provides the technical blueprints and strategic philosophy needed to turn raw hardware into a world-class discovery engine.

In this guide, we are not attempting to list every possibility but, rather, to offer the most common "tried and tested" configurations, with a bias toward modern, simple tooling that is open source and easy to maintain.

Note: This is a "living book" written on GitHub. We are looking for contributions from industry and academic experts to make recommendations. Found a mistake? Open an Issue or submit a Pull Request.

Table of Contents

Background

Configurations

Configuration Documentation
The Single User AI Workstation (1 Node, 1 User)
Overview and Recommendations
• Step-by-Step OS Installation:
  • Setup Ubuntu 22.04 Server with CUDA Support for a Single User ML Workstation
  • Setup Ubuntu with AMD ROCm Support for a Single User ML Workstation
  • Which Mac to Buy for Machine Learning
The "Under-the-Desk" Server (1 Node, Multiuser)
Overview and Recommendations
Step-by-Step Install Instructions
The "Closet Cluster" (2–5 Nodes)
Overview and Recommendations
Option 1: Step-by-Step Install Instructions – Building a Small SkyPilot + k3s Cluster (recommended)
Option 2: Build a k3s Cluster using Skypilot + Rancher
The "Mac Silicon Cluster"
Overview and Recommendations
The "University Cluster" (10–100 Nodes)
Overview and Recommendations
Option 1: Step-by-Step Install Instructions – Building a Large Kubernetes Cluster with Rancher + Skypilot
Option 2: Step-by-Step Install Instructions – Building a Large Slurm + Transformer Lab cluster
Single Cloud or Hybrid Cloud Cluster
Overview and Recommendations

How They Built It: Real-World ML Clusters

Academic Clusters

Startup/Corporate Clusters

Have you built an ML cluster at your university or company? Are the docs public? Please share here.