Skip to content

AMD-AGI/maxtext-slurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

197 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MaxText on Slurm

Toolkit for launching and observing MaxText — a JAX-based LLM training framework — on Slurm-managed AMD GPU clusters. One command runs distributed training with model selection, container setup, and multi-node coordination handled automatically. A carefully designed observability stack (via RAY=1) records GPU state, host health, network diagnostics, and training metrics into a single time-series database (TSDB) — essential for monitoring long-running production training, diagnosing incidents, and day-to-day tuning and validation.

An extensible AI skill framework equips AI assistants with out-of-the-box domain expertise across the full training lifecycle — from profiling and performance analysis to production incident diagnosis — with more skills added over time.

This is a reference implementation of a layered launch architecture where each tier — orchestration, container, training — can be adapted independently without cascading changes, including swapping the GPU vendor. See Architecture for the full training flow and Extensibility for what each swap involves.

For Kubernetes job submission, see Kubernetes job submission.

Quickstart

Clone the repo onto a Slurm-managed GPU cluster and run:

submit.sh 70b -N 1              # Train llama2-70b on 1 node
run_local.sh 70b -- steps=10    # Run locally (no Slurm)
RAY=1 submit.sh 70b -N 1        # With full-stack observability

70b, mixtral, grok, etc. are short names that resolve to config files in configs/. To train a model not already listed, add a config.

Shared filesystem. Multi-node jobs need all nodes to reach the output directory. Either clone this repo onto a shared filesystem (the default outputs/ dir works as-is), or set JOB_WORKSPACE to a shared path.

# Common options
JOB_WORKSPACE=/shared/maxtext_jobs submit.sh 70b -N 1            # Shared FS for outputs
submit.sh 70b -N 1 -- remat_policy=full per_device_batch_size=2  # MaxText args after --
submit.sh 70b -N 1 -- _env_NCCL_DEBUG=INFO                       # Env var overrides
STAGE_TIMEOUTS="preflight:600,train:86400" submit.sh 70b -N 24   # Per-stage timeouts

Observe

RAY=1 adds full-stack observability with no measurable steady-state throughput impact. Fully self-contained in the Docker-based stack — all necessary components are installed automatically at container startup, no cluster-wide setup required. Powered by Ray and Prometheus with an extensible plugin system, the pipeline collects GPU, host, network, and TensorBoard training metrics into a single Prometheus TSDB — persisted to the shared filesystem, queryable live and after the job ends, making the TSDB the ground truth data layer for automated reasoning and diagnosis.

RAY=1 submit.sh 70b -N 8

Live dashboards

Dashboard What it shows Port
TensorBoard Training loss curves, learning rate schedules 6006
Ray Dashboard Actor status, live stack traces, flame graphs 8265
Prometheus GPU, host, network, and training metrics — unified TSDB 9190 (auto-increments if occupied)

Post-run browsing

utils/prometheus.sh list                                          # List jobs with saved metrics
utils/prometheus.sh view outputs/12345-JAX-llama2-70b/prometheus  # Browse metrics after the job ends

Telegram notifications

Push notifications for job state changes and hang detection — works with any Slurm job, no RAY=1 required. See Notifications for setup.

utils/slurm_job_monitor.sh -j <slurm_job_id>

See Observability for the full story.

Agentic workflows

The skills/ directory contains structured instructions for AI agents (Cursor, Claude Code). Each skill encodes the methodology from very senior systems engineers — not just what commands to run, but how to interpret results, distinguish symptoms from root causes, and trace causal chains across the full stack. Skills ship for performance analysis, job failure triage, and TSDB-based incident diagnosis. See skills/README.md for the full list.

Learn more

Launch

Topic Description
Job Submission JOB_WORKSPACE, full submit.sh syntax, local runs, checkpointing, model aliases, environment config, artifacts
Kubernetes job submission k8s_submit.sh usage, PVC setup, per-rank logs, differences from Slurm, direct-container runs
Model Configs Available models, adding new configs (patterns, section layout), resolution rules, CLI overrides

Observe

Topic Description
Observability Real-time monitoring, post-run diagnostics, unified TSDB, custom metrics plugins, design rationale
Notifications Telegram setup, programmable messaging (telegram_bot.sh), automated job monitoring
Performance Profiling (xplane traces, HLO dumps), analysis (TraceLens, IRLens), tuning
Debugging Core dumps, crash reproduction, unresponsive nodes
Tooling Performance analysis, log tailing, job monitoring, Prometheus inspection, artifact cleanup

Adapt

Topic Description
Architecture Training flow, artifact system, observability pipeline, orchestrator extensibility
Extensibility Layer map; how to swap schedulers, runtimes, training frameworks, or GPU vendors

Acknowledgements

  • Documentation compiled with the assistance of Cursor and Claude-4.6-opus-high.
  • Observability stack implemented in collaboration with Cursor and Claude-4.6-opus-high.
  • AI skill framework and skills built with Cursor and Claude-4.6-opus-high.
  • Some utility scripts developed using ChatGPT 5/5.1.

License

This project is licensed under the MIT License.

About

Toolkit for launching and observing MaxText training on Slurm-managed GPU clusters

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages