MaxText on Slurm

Toolkit for launching and observing MaxText — a JAX-based LLM training framework — on Slurm-managed AMD GPU clusters. One command runs distributed training with model selection, container setup, and multi-node coordination handled automatically. A carefully designed observability stack (via RAY=1) records GPU state, host health, network diagnostics, and training metrics into a single time-series database (TSDB) — essential for monitoring long-running production training, diagnosing incidents, and day-to-day tuning and validation.

An extensible AI skill framework equips AI assistants with out-of-the-box domain expertise across the full training lifecycle — from profiling and performance analysis to production incident diagnosis — with more skills added over time.

This is a reference implementation of a layered launch architecture where each tier — orchestration, container, training — can be adapted independently without cascading changes, including swapping the GPU vendor. See Architecture for the full training flow and Extensibility for what each swap involves.

For Kubernetes job submission, see Kubernetes job submission.

Quickstart

Clone the repo onto a Slurm-managed GPU cluster and run:

submit.sh 70b -N 1              # Train llama2-70b on 1 node
run_local.sh 70b -- steps=10    # Run locally (no Slurm)
RAY=1 submit.sh 70b -N 1        # With full-stack observability

70b, mixtral, grok, etc. are short names that resolve to config files in configs/. To train a model not already listed, add a config.

Shared filesystem. Multi-node jobs need all nodes to reach the output directory. Either clone this repo onto a shared filesystem (the default outputs/ dir works as-is), or set JOB_WORKSPACE to a shared path.

# Common options
JOB_WORKSPACE=/shared/maxtext_jobs submit.sh 70b -N 1            # Shared FS for outputs
submit.sh 70b -N 1 -- remat_policy=full per_device_batch_size=2  # MaxText args after --
submit.sh 70b -N 1 -- _env_NCCL_DEBUG=INFO                       # Env var overrides
STAGE_TIMEOUTS="preflight:600,train:86400" submit.sh 70b -N 24   # Per-stage timeouts

Observe

RAY=1 adds full-stack observability with no measurable steady-state throughput impact. Fully self-contained in the Docker-based stack — all necessary components are installed automatically at container startup, no cluster-wide setup required. Powered by Ray and Prometheus with an extensible plugin system, the pipeline collects GPU, host, network, and TensorBoard training metrics into a single Prometheus TSDB — persisted to the shared filesystem, queryable live and after the job ends, making the TSDB the ground truth data layer for automated reasoning and diagnosis.

RAY=1 submit.sh 70b -N 8

Live dashboards

Dashboard	What it shows	Port
TensorBoard	Training loss curves, learning rate schedules	6006
Ray Dashboard	Actor status, live stack traces, flame graphs	8265
Prometheus	GPU, host, network, and training metrics — unified TSDB	9190 (auto-increments if occupied)

Post-run browsing

utils/prometheus.sh list                                          # List jobs with saved metrics
utils/prometheus.sh view outputs/12345-JAX-llama2-70b/prometheus  # Browse metrics after the job ends

Telegram notifications

Push notifications for job state changes and hang detection — works with any Slurm job, no RAY=1 required. See Notifications for setup.

utils/slurm_job_monitor.sh -j <slurm_job_id>

See Observability for the full story.

Agentic workflows

The skills/ directory contains structured instructions for AI agents (Cursor, Claude Code). Each skill encodes the methodology from very senior systems engineers — not just what commands to run, but how to interpret results, distinguish symptoms from root causes, and trace causal chains across the full stack. Skills ship for performance analysis, job failure triage, and TSDB-based incident diagnosis. See skills/README.md for the full list.

Learn more

Launch

Topic	Description
Job Submission	`JOB_WORKSPACE`, full `submit.sh` syntax, local runs, checkpointing, model aliases, environment config, artifacts
Kubernetes job submission	`k8s_submit.sh` usage, PVC setup, per-rank logs, differences from Slurm, direct-container runs
Model Configs	Available models, adding new configs (patterns, section layout), resolution rules, CLI overrides

Observe

Topic	Description
Observability	Real-time monitoring, post-run diagnostics, unified TSDB, custom metrics plugins, design rationale
Notifications	Telegram setup, programmable messaging (`telegram_bot.sh`), automated job monitoring
Performance	Profiling (xplane traces, HLO dumps), analysis (TraceLens, IRLens), tuning
Debugging	Core dumps, crash reproduction, unresponsive nodes
Tooling	Performance analysis, log tailing, job monitoring, Prometheus inspection, artifact cleanup

Adapt

Topic	Description
Architecture	Training flow, artifact system, observability pipeline, orchestrator extensibility
Extensibility	Layer map; how to swap schedulers, runtimes, training frameworks, or GPU vendors

Acknowledgements

Documentation compiled with the assistance of Cursor and Claude-4.6-opus-high.
Observability stack implemented in collaboration with Cursor and Claude-4.6-opus-high.
AI skill framework and skills built with Cursor and Claude-4.6-opus-high.
Some utility scripts developed using ChatGPT 5/5.1.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.github		.github
.host-cmd		.host-cmd
configs		configs
docs		docs
skills		skills
utils		utils
.editorconfig		.editorconfig
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
_container.sh		_container.sh
_job.sbatch		_job.sbatch
_k8s_job.sh		_k8s_job.sh
_ray_actor.py		_ray_actor.py
_train.sh		_train.sh
_train_with_ray.sh		_train_with_ray.sh
container_env.local.template		container_env.local.template
container_env.sh		container_env.sh
debug_repro.sh		debug_repro.sh
in_container_run.sh		in_container_run.sh
k8s_submit.sh		k8s_submit.sh
run_local.sh		run_local.sh
submit.sh		submit.sh
train_env.sh		train_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaxText on Slurm

Quickstart

Observe

Live dashboards

Post-run browsing

Telegram notifications

Agentic workflows

Learn more

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Languages

Folders and files

Latest commit

History

Repository files navigation

MaxText on Slurm

Quickstart

Observe

Live dashboards

Post-run browsing

Telegram notifications

Agentic workflows

Learn more

Acknowledgements

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Languages

Packages