AI-Hypercomputer
diff --git a/‎docs/conf.py‎
Lines changed: 4 additions & 3 deletions b/‎docs/conf.py‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎docs/explanations.md‎
Lines changed: 4 additions & 2 deletions b/‎docs/explanations.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/reference/alternatives.md‎ ‎docs/explanations/alternatives.md‎docs/reference/alternatives.md renamed to docs/explanations/alternatives.md
Lines changed: 1 addition & 1 deletion b/‎docs/reference/alternatives.md‎ ‎docs/explanations/alternatives.md‎docs/reference/alternatives.md renamed to docs/explanations/alternatives.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/checkpoints.md‎ ‎docs/explanations/checkpoints.md‎docs/guides/checkpoints.md renamed to docs/explanations/checkpoints.md
Lines changed: 8 additions & 6 deletions b/‎docs/guides/checkpoints.md‎ ‎docs/explanations/checkpoints.md‎docs/guides/checkpoints.md renamed to docs/explanations/checkpoints.md
Lines changed: 8 additions & 6 deletions
diff --git a/‎docs/guides/jax_ai_libraries_chosen.md‎ ‎…/explanations/jax_ai_libraries_chosen.md‎docs/guides/jax_ai_libraries_chosen.md renamed to docs/explanations/jax_ai_libraries_chosen.md
Lines changed: 8 additions & 15 deletions b/‎docs/guides/jax_ai_libraries_chosen.md‎ ‎…/explanations/jax_ai_libraries_chosen.md‎docs/guides/jax_ai_libraries_chosen.md renamed to docs/explanations/jax_ai_libraries_chosen.md
Lines changed: 8 additions & 15 deletions
diff --git a/‎docs/guides/llm_calculator.ipynb‎ ‎docs/explanations/llm_calculator.ipynb‎docs/guides/llm_calculator.ipynb renamed to docs/explanations/llm_calculator.ipynb b/‎docs/guides/llm_calculator.ipynb‎ ‎docs/explanations/llm_calculator.ipynb‎docs/guides/llm_calculator.ipynb renamed to docs/explanations/llm_calculator.ipynb
diff --git a/‎docs/guides/performance_metrics.md‎ ‎docs/explanations/performance_metrics.md‎docs/guides/performance_metrics.md renamed to docs/explanations/performance_metrics.md
Lines changed: 8 additions & 7 deletions b/‎docs/guides/performance_metrics.md‎ ‎docs/explanations/performance_metrics.md‎docs/guides/performance_metrics.md renamed to docs/explanations/performance_metrics.md
Lines changed: 8 additions & 7 deletions
diff --git a/‎docs/explanations/steps_model.md‎
Lines changed: 0 additions & 33 deletions b/‎docs/explanations/steps_model.md‎
Lines changed: 0 additions & 33 deletions
diff --git a/‎docs/guides.md‎
Lines changed: 2 additions & 10 deletions b/‎docs/guides.md‎
Lines changed: 2 additions & 10 deletions
diff --git a/‎docs/guides/checkpointing_solutions.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/guides/checkpointing_solutions.md‎
Lines changed: 9 additions & 0 deletions
@@ -36,6 +36,7 @@
 extensions = [
     "myst_nb",
     "sphinx_design",
+    "sphinx_copybutton",
 ]
 
 templates_path = ["_templates"]
@@ -59,7 +60,7 @@
 
 # Remove specific documents from ToC
 exclude_patterns = [
-    "guides/run_maxtext_via_multihost_job.md",
-    "guides/run_maxtext_via_multihost_runner.md",
-    "guides/llm_calculator.ipynb",
+    "guides/run_maxtext/run_maxtext_via_multihost_job.md",
+    "guides/run_maxtext/run_maxtext_via_multihost_runner.md",
+    "explanations/llm_calculator.ipynb",
 ]
@@ -19,9 +19,11 @@
 ```{toctree}
 :maxdepth: 1
 
-explanations/steps_model.md
+explanations/jax_ai_libraries_chosen.md
+explanations/alternatives.md
+explanations/checkpoints.md
 explanations/quantization.md
 explanations/sharding.md
-explanations/data_pipeline_perf.md
 explanations/tiling.md
+explanations/performance_metrics.md
 ```
@@ -14,7 +14,7 @@
  limitations under the License.
  -->
 
-# Comparison to Alternatives
+# Comparison to alternatives
 
 MaxText is heavily inspired by [MinGPT](https://github.com/karpathy/minGPT)/[NanoGPT](https://github.com/karpathy/nanoGPT), elegant standalone GPT implementations written in PyTorch and targeting Nvidia GPUs. MaxText is more complex, supporting more industry standard models and scaling to tens of thousands of chips. Ultimately MaxText has an MFU more than three times the [17%](https://twitter.com/karpathy/status/1613250489097027584?cxt=HHwWgIDUhbixteMsAAAA) reported most recently with that codebase, is massively scalable and implements a key-value cache for efficient auto-regressive decoding.
 
 
@@ -16,7 +16,7 @@
 
 # Checkpoints
 
-## Checkpoint Formats
+## Checkpoint formats
 
 Checkpoint formats in MaxText can be categorized along two axes: whether they include **training states** (e.g., optimizer properties) and whether the model's parameter weights are **stacked** or **unstacked** (aka scanned/unscanned). This results in the four types summarized below:
 
@@ -27,13 +27,13 @@ Checkpoint formats in MaxText can be categorized along two axes: whether they in
 
 We discuss these two axes respectively:
 
-### Training States
+### Training states
 
 Checkpoints with a **training state** contain more than just the model's parameter weights. They also include the **optimizer state** (e.g., momentum values), which is essential for resuming a training run exactly where it left off. These "training checkpoints" are typically saved as snapshots during training to allow for recovery if the process is interrupted.
 
 In contrast, **inference checkpoints** contain only the parameter weights. We also call them parameter only/param-only checkpoints. This is the format most commonly used for sharing models on public platforms like HuggingFace, as they are smaller and ready for immediate use in inference or for fine-tuning.
 
-### Stacked Checkpoints and JAX Scan Function 
+### Stacked checkpoints and JAX scan function 
 
 The concept of stacked vs. unstacked checkpoints is specific to JAX-based models that use the `jax.lax.scan` function ([doc](https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.scan.html)). `scan` is a powerful JAX feature that compiles sequential operations (like the layers of a Transformer) into a single, highly optimized kernel, avoiding the overhead of a Python for-loop.
 
@@ -78,11 +78,11 @@ In MaxText, we treat **Stacked Inference Checkpoints** as the default format for
 
 ---
 
-## Using Checkpoints in Practice
+## Using checkpoints in practice
 
 Beyond understanding the formats, it's crucial to know how to use checkpoints in your training workflows. MaxText uses flags in the configuration file or on the command line to manage checkpoints.
 
-### Saving Checkpoints During Training
+### Saving checkpoints during training
 
 MaxText automatically saves checkpoints periodically during a training run. These are **Stacked Training Checkpoints** that contain the full state needed to resume.
 
@@ -97,4 +97,6 @@ Furthermore, MaxText supports emergency checkpointing, which saves a local copy
 -   `local_checkpoint_directory`: The local path for storing emergency checkpoints.
 -   `local_checkpoint_period`: The interval, in training steps, for saving local checkpoints.
 
-More configs about checkpoints can be found in [here](https://github.com/AI-Hypercomputer/maxtext/blob/518a87037abb2497a2514ff0c8ffc263c69c6f9f/MaxText/configs/base.yml#L23-L65).
+More configs about checkpoints can be found in [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/configs/base.yml#L23-L65).
+
+For practical guides on checkpointing, please refer to [](../guides/checkpointing_solutions.md).
@@ -1,4 +1,4 @@
-# The JAX Ecosystem in MaxText: An Opinionated Guide
+# The JAX ecosystem in MaxText: an opinionated guide
 
 MaxText is built on a curated stack of JAX libraries, each chosen for a specific purpose. This document provides an opinionated view on *why* MaxText uses the following key components of the JAX ecosystem:
 
@@ -11,11 +11,9 @@ MaxText is built on a curated stack of JAX libraries, each chosen for a specific
 
 This stack isn't just a random collection of tools; it represents a design philosophy centered around **explicitness, composability, and performance at scale**.
 
-
 This document provides an opinionated view on *why* MaxText uses these specific libraries, explaining the design decisions that make them ideal for building and training large-scale models.
 
-
-## Flax: For Functional Model Definition
+## Flax: For functional model definition
 
 **What is it?** Flax is a high-performance neural network library for JAX that is designed to be flexible, explicit, and easy to use. 
 
@@ -27,8 +25,7 @@ With its latest generation API, NNX, Flax provides a modern, object-oriented (OO
 
 For more information on using Flax, please refer to https://github.com/google/flax
 
-
-## Optax: For Composable Optimization
+## Optax: For composable optimization
 
 **What is it?** Optax is a gradient processing and optimization library for JAX. It reimagines the optimizer as a series of composable functional transformations.
 
@@ -38,8 +35,7 @@ For more information on using Flax, please refer to https://github.com/google/fl
 
 For more information on using Optax, please refer to https://github.com/google-deepmind/optax
 
-
-## Orbax: For Robust Checkpointing
+## Orbax: For robust checkpointing
 
 **What is it?** Orbax is a library for checkpointing JAX programs, designed for large-scale, potentially unreliable environments.
 
@@ -54,8 +50,7 @@ For massive models, saving and loading state is a critical part of the training
 
 For more information on using Orbax, please refer to https://github.com/google/orbax
 
-
-## Grain: For Deterministic, Multi-Host Data Loading
+## Grain: For deterministic, multi-host data loading
 
 **What is it?** Grain is a high-performance data loading library designed for deterministic, global shuffle and multi-host data loading.
 
@@ -67,8 +62,7 @@ Its APIs are explicitly designed for the multi-host paradigm, simplifying the pr
 
 For more information on using Grain, please refer to https://github.com/google/grain and the grain guide in maxtext located at https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/data_input_grain.md 
 
-
-## Qwix: For Native JAX Quantization
+## Qwix: For native JAX quantization
 
 **What is it?** Qwix is a Jax quantization library supporting Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ)
 
@@ -79,8 +73,7 @@ We chose Qwix because it provides the necessary primitives **natively within the
 
 For more information on how to quantize your model using Qwix, please refer to https://github.com/google/qwix
 
-
-## Tunix: For Comprehensive Post-Training
+## Tunix: For comprehensive post-training
 
 **What is it?** Tunix is a JAX-based library designed for a wide range of post-training tasks, including Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), and Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA.
 
@@ -95,4 +88,4 @@ MaxText leverages Tunix as its core library for post-training, offering a unifie
 
 We chose Tunix because it provides a **comprehensive, performant, and JAX-native solution for the entire post-training lifecycle**. Its integration with libraries like vLLM and its alignment with the NNX ecosystem make it a powerful tool for both full model adaptation and parameter-efficient tuning. 
 
-For more information on using Tunix, please refer to https://github.com/google/tunix
+For more information on using Tunix, please refer to https://github.com/google/tunix
@@ -14,7 +14,8 @@
  limitations under the License.
  -->
 
-# Performance Metrics
+(performance-metrics)=
+# Performance metrics
 
 ## MFU
 
@@ -57,11 +58,11 @@ $$
 
 Hence, MFU is the fraction of peak hardware performance actually utilized by the model, and can be expressed in different units — step time, throughput, or raw flops/s.
 
-### MaxText Calculating + Reporting
-In MaxText, we sum all of the matmuls performed in one step, see [calculate_tflops_per_device](https://github.com/AI-Hypercomputer/maxtext/blob/e969faabbb571285a51545530f34d8f0a9f237e9/MaxText/maxtext_utils.py#L297)
-and divide it by the measured (via python `time.time()`) step time. In each step we print the resulting Model Flops per second [`per_device_tflops_per_sec`](https://github.com/AI-Hypercomputer/maxtext/blob/e969faabbb571285a51545530f34d8f0a9f237e9/MaxText/metric_logger.py#L193-L194). One can calculate the MFU by dividing this number by the peak tflops of the hardware (e.g., $918e^{12}$ FLOPS/s for Trillium).
+### MaxText calculating + reporting
+In MaxText, we sum all of the matmuls performed in one step, see [calculate_tflops_per_device](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/maxtext_utils.py#L454)
+and divide it by the measured (via python `time.time()`) step time. In each step we print the resulting Model Flops per second [`per_device_tflops_per_sec`](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/metric_logger.py#L211-L213). One can calculate the MFU by dividing this number by the peak tflops of the hardware (e.g., $918e^{12}$ FLOPS/s for Trillium).
 
-### Causal Attention
+### Causal attention
 Due to causality only half of the (query, key) pairs need to be computed, those with query_idx >= key_idx. This accounts for the fact only prior tokens can be used to predict future ones. Prior to https://github.com/AI-Hypercomputer/maxtext/pull/1988 MaxText did not account for sparsity for theoretical flops, and used
 
 Attention Flops ~= 4 * sequence^2 * batch * heads * head_dim
@@ -98,6 +99,6 @@ $$\begin{align*}
 
 This shows any of step time, tokens/s or MFU can be used to determine how long training will take and are proportionally (or inversely proportionally) related. MFU is most useful to compare across different models/hardwares and while optimizing performance, whereas step time or tokens/second may be more useful when these are fixed.
 
-## Why not Hardware Flops?
+## Why not hardware flops?
 
-Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxiliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/4b6142950aff5d9ba42d830efc5ce4c4ac9d4135/MaxText/layers/moe.py#L1267)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.
+Hardware (e.g., XLA reported) FLOPs do not accurately reflect computation efficiency as they depend on the program / implementation, not just on the model and its inherent computations (higher hardware FLOPs does not necessarily mean less room for improvement). For example, they include remat and potentially auxilliary operations (such as reshaping for dropping moe [here](https://github.com/AI-Hypercomputer/maxtext/blob/fafdeaa14183a8f5ca7b9f7b7542ce1655237574/src/MaxText/layers/moe.py#L1544)), which are an implementation detail and not part of the model. In addition, XLA reported FLOPs may not be accurate with pallas kernels. Hardware flops utilization is not (inversely) proportional to step time as opposed to MFU, since hardware flops can change with implementation details like remat policies.
@@ -19,26 +19,18 @@
 ```{toctree}
 :maxdepth: 1
 
-guides/checkpoints.md
+guides/run_maxtext.md
 guides/custom_model.md
-guides/run_maxtext_localhost.md
-guides/run_maxtext_via_xpk.md
-guides/run_maxtext_via_pathways.md
 guides/data_input_pipeline.md
-guides/single_host_gpu.md
 guides/knowledge_distillation.md
 guides/gcp_workload_observability.md
 guides/monitor_goodput.md
 guides/use_vertex_ai_tensorboard.md
 guides/features_and_diagnostics.md
 guides/pallas_kernels_performance.md
-guides/performance_metrics.md
 guides/understand_logs_and_metrics.md
-guides/checkpointing_solutions/gcs_checkpointing.md
-guides/checkpointing_solutions/emergency_checkpointing.md
-guides/checkpointing_solutions/multi_tier_checkpointing.md
-guides/jax_ai_libraries_chosen.md
 guides/xprof_user_guide.md
+guides/checkpointing_solutions.md
 guides/megascale_hang_playbook.md
 guides/multimodal.md
 ```
@@ -0,0 +1,9 @@
+# Checkpointing solutions
+
+```{toctree}
+:maxdepth: 1
+
+checkpointing_solutions/gcs_checkpointing.md
+checkpointing_solutions/emergency_checkpointing.md
+checkpointing_solutions/multi_tier_checkpointing.md
+```