Toolkit for launching and observing MaxText — a JAX-based LLM training framework — on Slurm-managed AMD GPU clusters. One command runs distributed training with model selection, container setup, and multi-node coordination handled automatically. A carefully designed observability stack (via RAY=1) records GPU state, host health, network diagnostics, and training metrics into a single time-series database (TSDB) — essential for monitoring long-running production training, diagnosing incidents, and day-to-day tuning and validation.
An extensible AI skill framework equips AI assistants with out-of-the-box domain expertise across the full training lifecycle — from profiling and performance analysis to production incident diagnosis — with more skills added over time.
This is a reference implementation of a layered launch architecture where each tier — orchestration, container, training — can be adapted independently without cascading changes, including swapping the GPU vendor. See Architecture for the full training flow and Extensibility for what each swap involves.
For Kubernetes job submission, see Kubernetes job submission.
Clone the repo onto a Slurm-managed GPU cluster and run:
submit.sh 70b -N 1 # Train llama2-70b on 1 node
run_local.sh 70b -- steps=10 # Run locally (no Slurm)
RAY=1 submit.sh 70b -N 1 # With full-stack observability70b, mixtral, grok, etc. are short names that resolve to config files in configs/. To train a model not already listed, add a config.
Shared filesystem. Multi-node jobs need all nodes to reach the output directory. Either clone this repo onto a shared filesystem (the default outputs/ dir works as-is), or set JOB_WORKSPACE to a shared path.
# Common options
JOB_WORKSPACE=/shared/maxtext_jobs submit.sh 70b -N 1 # Shared FS for outputs
submit.sh 70b -N 1 -- remat_policy=full per_device_batch_size=2 # MaxText args after --
submit.sh 70b -N 1 -- _env_NCCL_DEBUG=INFO # Env var overrides
STAGE_TIMEOUTS="preflight:600,train:86400" submit.sh 70b -N 24 # Per-stage timeoutsRAY=1 adds full-stack observability with no measurable steady-state throughput impact. Fully self-contained in the Docker-based stack — all necessary components are installed automatically at container startup, no cluster-wide setup required. Powered by Ray and Prometheus with an extensible plugin system, the pipeline collects GPU, host, network, and TensorBoard training metrics into a single Prometheus TSDB — persisted to the shared filesystem, queryable live and after the job ends, making the TSDB the ground truth data layer for automated reasoning and diagnosis.
RAY=1 submit.sh 70b -N 8| Dashboard | What it shows | Port |
|---|---|---|
| TensorBoard | Training loss curves, learning rate schedules | 6006 |
| Ray Dashboard | Actor status, live stack traces, flame graphs | 8265 |
| Prometheus | GPU, host, network, and training metrics — unified TSDB | 9190 (auto-increments if occupied) |
utils/prometheus.sh list # List jobs with saved metrics
utils/prometheus.sh view outputs/12345-JAX-llama2-70b/prometheus # Browse metrics after the job endsPush notifications for job state changes and hang detection — works with any Slurm job, no RAY=1 required. See Notifications for setup.
utils/slurm_job_monitor.sh -j <slurm_job_id>See Observability for the full story.
The skills/ directory contains structured instructions for AI agents (Cursor, Claude Code). Each skill encodes the methodology from very senior systems engineers — not just what commands to run, but how to interpret results, distinguish symptoms from root causes, and trace causal chains across the full stack. Skills ship for performance analysis, job failure triage, and TSDB-based incident diagnosis. See skills/README.md for the full list.
Launch
| Topic | Description |
|---|---|
| Job Submission | JOB_WORKSPACE, full submit.sh syntax, local runs, checkpointing, model aliases, environment config, artifacts |
| Kubernetes job submission | k8s_submit.sh usage, PVC setup, per-rank logs, differences from Slurm, direct-container runs |
| Model Configs | Available models, adding new configs (patterns, section layout), resolution rules, CLI overrides |
Observe
| Topic | Description |
|---|---|
| Observability | Real-time monitoring, post-run diagnostics, unified TSDB, custom metrics plugins, design rationale |
| Notifications | Telegram setup, programmable messaging (telegram_bot.sh), automated job monitoring |
| Performance | Profiling (xplane traces, HLO dumps), analysis (TraceLens, IRLens), tuning |
| Debugging | Core dumps, crash reproduction, unresponsive nodes |
| Tooling | Performance analysis, log tailing, job monitoring, Prometheus inspection, artifact cleanup |
Adapt
| Topic | Description |
|---|---|
| Architecture | Training flow, artifact system, observability pipeline, orchestrator extensibility |
| Extensibility | Layer map; how to swap schedulers, runtimes, training frameworks, or GPU vendors |
- Documentation compiled with the assistance of Cursor and Claude-4.6-opus-high.
- Observability stack implemented in collaboration with Cursor and Claude-4.6-opus-high.
- AI skill framework and skills built with Cursor and Claude-4.6-opus-high.
- Some utility scripts developed using ChatGPT 5/5.1.
This project is licensed under the MIT License.