marin-community
diff --git a/‎.agents/projects/grugformer.md‎
Lines changed: 32 additions & 211 deletions b/‎.agents/projects/grugformer.md‎
Lines changed: 32 additions & 211 deletions
diff --git a/‎docs/recipes/change_grug.md‎
Lines changed: 51 additions & 82 deletions b/‎docs/recipes/change_grug.md‎
Lines changed: 51 additions & 82 deletions
diff --git a/‎docs/reports/grug-archive.md‎
Lines changed: 13 additions & 35 deletions b/‎docs/reports/grug-archive.md‎
Lines changed: 13 additions & 35 deletions
diff --git a/‎experiments/grug/README.md‎
Lines changed: 116 additions & 0 deletions b/‎experiments/grug/README.md‎
Lines changed: 116 additions & 0 deletions
diff --git a/‎experiments/grug/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎experiments/grug/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎experiments/grug/base/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎experiments/grug/base/__init__.py‎
Lines changed: 2 additions & 0 deletions
@@ -1,112 +1,81 @@
-# Recipe: Changing Grug (Experiment → Canonical)
+# Recipe: Changing Grug (Template-First)
 
-Grug is meant to be “grug-simple” and easy to hack, but we still want a single, trustworthy “best guess” implementation in `levanter.grug`.
+Grug is intentionally template-first: the canonical edit surface lives in `experiments/grug/base/`, not in a shared `levanter.grug` trainer stack.
 
 This recipe describes the workflow for:
 
-1) trying changes safely in a speedrun experiment, and
-2) upstreaming successful ideas into the canonical core (and cleaning up old experiments).
+1. trying a change in an experiment copy, and
+2. upstreaming it into the base template when it proves out.
 
-## Source Of Truth vs Experiments
+## Source Of Truth
 
-- **Source of truth:** `lib/levanter/src/levanter/grug/`
-  - This is the “best guess” model. It should stay small, readable, and testable.
-- **Evolving experiment:** `experiments/speedrun/nano_arch_ablations/00_baseline/main.py`
-  - This is the *living* entrypoint that is expected to evolve as we learn.
-- **One-off experiments:** under `experiments/speedrun/…`
-  - These are snapshots / specialized edit surfaces (e.g. attention sinks).
+- **Canonical template:** `experiments/grug/base/`
+  - `model.py`
+  - `train.py`
+  - `launch.py`
+- **Variants:** `experiments/grug/<variant>/`
+  - copy from `base` and modify locally (for example MoE).
+- **One-off speedruns:** `experiments/speedrun/...`
+  - useful for exploration, not canonical.
 
-We try not to let one-off scripts become the canonical implementation.
+## Workflow
 
-## When You Want To Try Something
+### 1) Pick one change bucket
 
-### 1) Decide what you’re changing
+Keep each pass scoped to one bucket:
 
-Most changes fall into one bucket:
+- attention/masking
+- block wiring/norm ordering
+- MLP/activation
+- loss kernel behavior
+- optimizer/training loop behavior
 
-- **Attention** (masking semantics, kernels, sinks/aux, layout/sharding)
-- **Block** (residual wiring, normalization order, pre/post-norm)
-- **MLP** (activation, GLU variants, gating, dimension choices)
-- **Loss** (large-vocab CE, z-loss, label smoothing, logit soft-cap)
-- **Optimizer** (Adam, Muon, etc.)
+### 2) Experiment in a copy
 
-Try to change **one bucket at a time**. Optimizer isn't really (currently) addressed by Grug, but we'll get there.
+- Copy `experiments/grug/base` to a new variant directory.
+- Keep edits local and explicit (copy/paste over abstraction).
+- Avoid introducing reusable framework surface unless there's clear repeated use.
 
-### 2) Create an experiment entrypoint
+### 3) Record the experiment
 
-Start from:
+Update `docs/reports/grug-archive.md` with:
 
-- `experiments/speedrun/nano_arch_ablations/00_baseline/main.py`
+- path
+- commit SHA (when known)
+- purpose
+- status (`active`, `superseded`, `deleted`)
 
-Recommended workflow:
+### 4) Upstream to base if it wins
 
-1. Copy the file to a new experiment (or branch the baseline if the change is small):
-   - Example: `experiments/speedrun/<idea>/main.py`
-2. Keep the edit surface explicit:
-   - If you’re changing attention, keep the change in one local `attention()` or `attn_fn` block.
-   - If you’re changing the MLP, keep it local to an `mlp()` helper.
-3. Avoid introducing new abstractions (this is a speedrun file; copy/paste is fine).
+Port the successful change back into:
 
-### 3) Register the experiment in the archive
+- `experiments/grug/base/model.py`
+- `experiments/grug/base/train.py`
+- `experiments/grug/base/launch.py`
 
-Add an entry to:
+Keep it grug-style:
 
-- `docs/reports/grug-archive.md`
+- plain JAX arrays and explicit sharding
+- Equinox modules with `init` + `__call__`
+- minimal config knobs
+- keep legibility first; if a block gets hard to read, introduce a small local helper instead of adding framework indirection
 
-Record:
-- the experiment path,
-- the commit SHA (once known),
-- what you changed and why,
-- the intended “status” (`active`, `superseded`, `deleted`).
+### 5) Delete stale paths
 
-## When You Want To Adopt Something As Canonical
+After upstreaming:
 
-### 1) Port to `levanter.grug`
+- delete superseded experiment code,
+- keep only the archive trail in `docs/reports/grug-archive.md`.
 
-Move the change into one of the core files:
+### 6) Validate
 
-- `lib/levanter/src/levanter/grug/attention.py`
-- `lib/levanter/src/levanter/grug/model.py`
-- `lib/levanter/src/levanter/grug/loss.py`
+Run the relevant checks:
 
-Keep the “grug” style:
-- Equinox modules for model components (`Transformer`, `Block`, `MLP`, `RMSNorm`, attention),
-- explicit module pattern per component: `@staticmethod init(...)` + `__call__(...)`,
-- model API on module methods (`Transformer.init`, `Transformer.__call__`, `Transformer.logits`, `Transformer.next_token_loss`),
-- explicit sharding when needed (and loud failures otherwise).
-
-### 2) Update/extend tests
-
-Add or adjust tests to lock the intended surface:
-
-- `lib/levanter/tests/grug/test_grugformer_core.py`
-- `lib/levanter/tests/grug/test_grugformer_model_loss.py`
-- `lib/levanter/tests/grug/test_grugformer_compilation.py`
-
-The goal is:
-- shapes don’t regress,
-- `jit` still works,
-- sharding doesn’t explode,
-- mask semantics remain correct.
-
-### 3) Clean up old experiments
-
-After merging a canonical improvement:
-
-- If an experiment is now redundant and not referenced, **delete it** and mark it `deleted` in `docs/reports/grug-archive.md`.
-- If an experiment represents a meaningful historical run, keep it but mark it `superseded`, and point to the canonical change (or the new experiment).
-  Do this only if it's not going to be a maintenance burden.
-
-Prefer “archive entry + deletion” over keeping lots of old code in-tree.
-
-### 4) Run repo checks
-
-Before sending the PR:
-
-```sh
+```bash
 uv run python infra/pre-commit.py --all-files
+uv run pytest tests/test_grug_base_template.py
 ```
 
-## Notes / Inspiration
+Add any additional focused tests needed for behavior changes.
 
-This workflow is inspired by projects like `modded-nanogpt`: keep a small, readable core, iterate quickly via “hackable” entrypoints, and regularly upstream what works.
+This workflow is inspired by modded-nanogpt: iterate quickly in copy-paste experiments, then upstream only what stays simple and useful.
@@ -1,61 +1,39 @@
 # Grug Archive: Experiments and Snapshots
 
-This file is a lightweight “paper trail” for Grug-related experiments, inspired by the idea of keeping a runnable history without letting a pile of one-off scripts become the de facto source of truth.
+This file is the paper trail for grug experiments.
 
 ## Principles
 
-- **`levanter.grug` is the source of truth.** Speedrun files are snapshots/entrypoints, not the canonical implementation.
-- **Every experiment should be attributable to a commit.** If an experiment is removed or superseded, it should be clear what replaced it and why.
-- **Prefer deletion over permanent snapshots.** If a script is dead, delete it and record the last known-good commit here.
-- **Keep diffs small.** When an experiment is kept “alive”, update it to track the current core rather than forking the entire model.
-
-## When Grug Core Changes
-
-When a change in `levanter.grug` is likely to affect results, performance, or semantics:
-
-1. Update the experiment(s) that should track “best guess”.
-2. For experiments that no longer make sense:
-   - delete them, or
-   - mark them superseded and point to the replacement.
-3. Update the corresponding entry in this archive (and any linked issue).
+- `experiments/grug/base/` is the canonical template.
+- Speedrun files are exploratory and may be deleted after upstreaming.
+- Prefer deletion over long-term maintenance of stale experiment code.
 
 ## Entry Template
 
-Copy/paste this block for new experiments:
-
 ```text
 ### <experiment-id>
-- Path: `<repo-relative-path>`
+- Path: <repo-relative-path>
 - Introduced: <commit-sha>
 - Last known-good: <commit-sha>
 - Status: active | superseded | deleted
 - Purpose: <one line>
-- Notes: <optional; what changed, how to reproduce, caveats>
-- Superseded by: <experiment-id or commit-sha; optional>
-- Issue: <url or issue id; optional>
+- Superseded by: <path or commit; optional>
+- Issue: <url/id; optional>
 ```
 
 ## Experiments
 
-### grugformer-attnsink
-- Path: `experiments/speedrun/grugformer_attnsink/grugformer_attn_sink.py`
-- Introduced: TBD
-- Last known-good: TBD
-- Status: active
-- Purpose: “Hackable” Grug attention-sink variant; intended edit surface for sinks/aux.
-- Notes: Keep this file short; copy/paste local modifications rather than growing new abstractions.
-
-### grugformer-starter-speedrun
-- Path: `experiments/speedrun/grugformer_starter/grugformer_speedrun.py`
+### grug-base-template
+- Path: `experiments/grug/base/`
 - Introduced: TBD
 - Last known-good: TBD
 - Status: active
-- Purpose: Minimal starter speedrun for Grug; convenient baseline for quick iteration.
+- Purpose: canonical grug template (model/train/launch).
 
 ### grugformer-vs-hackable-125m
 - Path: `experiments/speedrun/grugformer_vs_hackable_125m/grugformer_vs_hackable_125m.py`
 - Introduced: TBD
 - Last known-good: TBD
-- Status: active
-- Purpose: Head-to-head comparison between Hackable Transformer and Grugformer (no sinks).
-
+- Status: deleted
+- Purpose: historical head-to-head comparison.
+- Superseded by: template-first workflow centered on `experiments/grug/base/`.
@@ -0,0 +1,116 @@
+# Grug Layout and Usage
+
+`experiments/grug/` is template-first. You edit experiment code here directly.
+
+## Directory layout
+
+- `base/model.py`: model config and model implementation (`init` + `__call__` + loss method).
+- `base/train.py`: train loop, optimizer step, callbacks, eval/checkpoint wiring.
+- `base/launch.py`: experiment config and execution entrypoint (`ExecutorStep` + resources).
+
+## Entry-point guide
+
+- Start in `base/launch.py` for normal run edits.
+- `GrugBaseLaunchConfig` is the user-facing knob surface (model/data/optimizer/trainer/eval/run metadata).
+- `versioned(...)` marks config values that should affect executor step version/hash.
+- `this_output_path()` resolves to the current step's output root.
+- `run_grug(...)` in `base/train.py` is the runtime entry point used by the `ExecutorStep`.
+- `P` in train/model code is the usual JAX alias for `PartitionSpec`; see the JAX explicit sharding tutorial: [Explicit Sharding (JAX)](https://docs.jax.dev/en/latest/notebooks/explicit-sharding.html).
+
+## How to use it
+
+1. Copy `experiments/grug/base` to a new variant directory (for example `experiments/grug/moe`).
+2. Make model/training changes in that variant, not in shared trainer libraries.
+3. Set run knobs in `<variant>/launch.py` (run id, data mix, optimizer, TPU type).
+4. Launch from the variant's `launch.py` entrypoint.
+
+## Quickstart launch
+
+Local executor run:
+
+```bash
+uv run python experiments/grug/base/launch.py
+```
+
+Ray cluster run:
+
+```bash
+uv run lib/marin/src/marin/run/ray_run.py \
+  --env_vars WANDB_API_KEY=${WANDB_API_KEY} \
+  -- python experiments/grug/base/launch.py
+```
+
+## Common edit points
+
+- Architecture changes: `experiments/grug/base/model.py`
+- Train-loop and callback behavior: `experiments/grug/base/train.py`
+- Run config, resources, and launch wiring: `experiments/grug/base/launch.py`
+- Copy-paste variant workflow: duplicate `experiments/grug/base/` into `experiments/grug/<variant>/` and edit there.
+
+## Trainer knobs people ask about
+
+- `z_loss_weight` in `GrugTrainerConfig`: weight on the logsumexp stabilization term in LM loss.
+- `ema_beta` in `GrugTrainerConfig`: exponential moving average (EMA) coefficient for eval/checkpoint model; `None` disables EMA.
+
+## Checkpoints and resume
+
+- Checkpoints are written to `<output_path>/checkpoints` by default in `base/launch.py`.
+- `run_grug` restores from `trainer.load_checkpoint_path` when set, otherwise tries the run checkpoint path.
+- If `trainer.load_checkpoint=True` and no checkpoint is found, startup fails; otherwise it starts from scratch.
+
+## Environment variables you will likely use
+
+- `WANDB_API_KEY`: required for W&B logging in the default launch config.
+- `GRUG_RUN_ID`: overrides the default run id.
+- `FERRY_DATE`: appended to run id for ferry-style launches.
+
+## Where outputs show up
+
+- Training/eval metrics: tracker backend (default W&B).
+- Checkpoints: `<output_path>/checkpoints`.
+- Profiler traces (if enabled): `<trainer.log_dir>/<run_id>/profiler`.
+- Executor step outputs: `this_output_path()` root for the step.
+
+## Logged metrics
+
+- `train/loss`: training loss for the just-completed step.
+- `global_step`: completed optimizer step index.
+- `run_progress`: completed fraction of the configured run (`step / total_steps`).
+- `optim/*`: optimizer hyperparameters from Optax state (for example `optim/learning_rate`).
+- `throughput/duration`: step wall-clock duration (after loss is materialized).
+- `throughput/examples_per_second`: examples processed per second for the current batch size.
+- `throughput/tokens_per_second`: tokens processed per second.
+- `throughput/total_tokens`: cumulative tokens processed so far (schedule-aware).
+- `throughput/gflops_per_second`: model FLOP throughput from analytic FLOPs-per-example.
+- `throughput/total_gflops`: cumulative model FLOPs (analytic).
+- `throughput/mfu`: model FLOP utilization as percent of theoretical hardware FLOPs.
+- `throughput/hook_time`: callback/logging overhead time after each step.
+- `throughput/loading_time`: dataloader wait time for the current step.
+- `throughput/flops_per_token_analytic`: analytic FLOPs per token summary value.
+- `throughput/flops_per_example_analytic`: analytic FLOPs per example summary value.
+- `throughput/flops_per_example`: FLOPs-per-example value used by throughput callback.
+- `throughput/device_kind`: accelerator type string from JAX device info.
+- `throughput/theoretical_flops_per_device`: theoretical peak FLOPs per device.
+- `throughput/theoretical_flops`: theoretical peak FLOPs across all devices.
+- `mixture/stage`: current data-mixture stage index.
+- `mixture/weight/<dataset_name>`: effective sampling weight per dataset in the active stage.
+- `eval/loss`, `eval/loading_time`, `eval/total_time`: tagged-eval loss and timing for current model.
+- `eval/ema/*`: same eval metrics for EMA weights when EMA is enabled.
+- `eval/macro_loss`: macro average loss across tags when multiple tags exist.
+- `eval/<tag>/loss`, `eval/<tag>/micro_loss`, `eval/<tag>/macro_loss`: per-tag loss views.
+- `eval/bpb`, `eval/macro_bpb`, `eval/<tag>/bpb`, `eval/<tag>/macro_bpb`: bits-per-byte metrics when tokenizer/BPB logging is enabled.
+- `grad/*`, `params/*`, `updates/*`, `opt_state/*`: optional watch metrics (norms/histograms) when watch is enabled.
+
+## What should stay consistent
+
+- Keep core training/eval metrics aligned with classic Levanter (`train/loss`, `throughput/*`, `eval/*`).
+- Prefer shared helpers only for generic infrastructure; keep variant behavior local to the template.
+
+## Further guidance
+
+- Grug principles: [`/.agents/projects/grugformer.md`](../../.agents/projects/grugformer.md)
+- Change workflow: [`/docs/recipes/change_grug.md`](../../docs/recipes/change_grug.md)
+- Executor mechanics: [`/docs/explanations/executor.md`](../../docs/explanations/executor.md)
+- Executor tutorial: [`/docs/tutorials/executor-101.md`](../../docs/tutorials/executor-101.md)
+- TPU debug workflow: [`/docs/dev-guide/dev_tpu.md`](../../docs/dev-guide/dev_tpu.md)
+- Cluster launch details: [`/docs/tutorials/tpu-cluster-setup.md`](../../docs/tutorials/tpu-cluster-setup.md)
@@ -0,0 +1,2 @@
+# Copyright 2025 The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,2 @@
+# Copyright 2025 The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# Copyright 2025 The Marin Authors`
	`2`	`+# SPDX-License-Identifier: Apache-2.0`