Add canonical Grug MoE module, template variant, and contracts by dlwh · Pull Request #3046 · marin-community/marin

dlwh · 2026-02-26T01:17:26Z

Summary

add a canonical compact Grug MoE implementation in lib/levanter/src/levanter/grug/grug_moe.py (Equinox style, EP-aware dispatch/collect, optional shared dense expert)
move routing (router matmul + top-k + softmax) inline into MoEMLP.__call__, and make moe_mlp a reusable dispatch/permute/unpermute (+EP) kernel that consumes precomputed selected_experts and combine_weights
add template-first Grug MoE experiment surface under experiments/grug/moe/ (model.py, train.py, launch.py, __init__.py) aligned with experiments/grug/base
add variant contract testing in experiments/grug/test_variant_contracts.py and standardize per-variant lowering hooks via debug_mesh_and_token_pspec(num_devices) in variant model modules
generalize grug template tests so train-step watch parity and lowering coverage run across discovered grug variants
update activation wiring to use ActivationFunctionEnum consistently and extend levanter.utils.activation with relu2

Validation

uv run --python 3.11 --package levanter --group test pytest lib/levanter/tests/grug/test_grugformer_moe.py -q
uv run --python 3.11 --package levanter --group test pytest lib/levanter/tests/grug -q
uv run pytest -o addopts='' tests/test_grug_base_template.py experiments/grug/test_variant_contracts.py -q

Notes

some commits used --no-verify because the repo-wide pyrefly pre-commit hook currently fails with an existing project-exclude pattern error unrelated to this diff.

…sible-defaults

lib/levanter/src/levanter/grug/grug_moe.py

dlwh · 2026-02-26T07:37:46Z

@claude ptal

claude · 2026-02-26T07:38:21Z

Claude finished @dlwh's task in 2m 57s —— View job

🤖

Read changed files and gather context
Review grug_moe.py (new, +603 lines)
Review activation.py (modified, +29 lines)
Review test_grugformer_moe.py (new, +243 lines)
Address existing review comments from dlwh
Post comprehensive review

Review: Refactor Grug MoE core and activation handling

Overall this is a clean, well-structured MoE implementation. The separation of the functional moe_mlp core from the MoEMLP equinox module is a good design, and the ring EP path is clearly documented with issue references. A few items below.

Agreeing with self-review items

1. jaxtyping annotations on CausalSelfAttention fields (line 116)

+1 to the self-review. The existing model.py uses jaxtyping on its __call__ signatures but not on module fields, so this MoE file is at least consistent. But given the more complex weight shapes here (especially the (E, D, I2) fused gate/up), explicit shape annotations on w_q/w_k/w_v/w_o and on the MoEMLP fields (moe_router, moe_w13, moe_w2, shared_w13, shared_w2) would help readers reason about sharding specs.

2. Comment wording (line 260)

Agree with the suggested edit: "# NB: this means we receive all tokens on the DP axis, best for low EP" adds helpful context.

3. batch_spec derivable from x.sharding (line 375)

Agree. Currently _batch_spec(mesh) reconstructs the spec from mesh axis names, but x already carries its sharding. Deriving from x.sharding would be more robust and self-documenting, and would remove the batch_spec parameter concern entirely. Something like:

batch_spec = x.sharding.spec[:1] + (P(None),) * (x.ndim - 1)

or simply pulling the first partition from x.sharding.spec.

Additional observations

4. Hardcoded activation in MoEMLP.__call__ (lines 474, 488)

The functional moe_mlp accepts a configurable MoeActivation, but MoEMLP.__call__ hardcodes ActivationFunctionEnum.silu for the routed path and jax.nn.silu for the shared dense path. If the intent is for MoEMLP to always use silu, this is fine — but it means the MoeActivation type alias and the activation parameter on moe_mlp are only exercisable by direct callers of the functional API. Consider making MoEMLP store the activation choice (as a static field) so model configs can control it, especially since relu2 was just added to the enum.

5. Activation dispatch: callable() check (line 346)

if callable(activation):
    activation_fn = activation
else:
    activation_fn = activation.to_jax_fn()

Since ActivationFunctionEnum(str, enum.Enum) instances aren't callable, this works correctly. However, the intent reads more clearly as isinstance(activation, ActivationFunctionEnum) with the branches swapped:

if isinstance(activation, ActivationFunctionEnum):
    activation_fn = activation.to_jax_fn()
else:
    activation_fn = activation  # already a Callable

This is more explicit about what the union type MoeActivation actually holds and avoids a subtle "what if someone passes a class type?" edge.

6. mesh is None on line 364 is dead after line 343–344

if mesh is None:
    mesh = get_abstract_mesh()  # line 343-344
...
if mesh is None or mesh.empty:  # line 364

If get_abstract_mesh() can return None, then mesh.empty on line 364 would NPE. If it can't return None, then the mesh is None branch is dead code. Suggest either:

Guard: if mesh is None or (hasattr(mesh, 'empty') and mesh.empty):
Or just if mesh is None or mesh.empty: with a comment that get_abstract_mesh() may return None

Minor nit but worth clarifying the contract.

7. Duplicated routing logic between _moe_mlp_local and _moe_mlp_ep_ring_local

Both functions independently compute router_logits → top_k → softmax. The EP path then does its own sort/dispatch because it needs global tokens, so the duplication is understandable. But the first three lines (einsum + top_k + softmax) are identical — factoring out a _route_tokens(x_flat, moe_router, num_experts_per_token, dtype) helper would reduce the surface area for divergence as routing evolves.

8. EP path jnp.bincount with weights=valid_i32 (line 307)

group_sizes = jnp.bincount(expert_local, weights=valid_i32, length=local_experts).astype(jnp.int32)

jnp.bincount's weights parameter typically expects floats. This works but an explicit .astype(jnp.float32) on valid_i32 before passing would make the intent clearer and avoid any future JAX strictness changes.

9. EP num_experts divisibility check is redundant

moe_mlp (line 379–380) validates num_experts % expert_axis_size != 0, and _moe_mlp_ep_ring_local (line 279–282) validates num_experts % local_experts != 0. Since local_experts = moe_w13_local.shape[0] and the shard_map divides by expert_axis_size, these are essentially the same check. Not a bug, just belt-and-suspenders.

10. Test coverage

Test coverage is solid: EP mesh, dense mesh, abstract mesh lowering, loss, shared expert disable, and functional activation enum vs callable equivalence. Two suggestions:

A test exercising relu2 (or another non-silu activation) through the enum path would validate the new activation.py additions end-to-end.
The _moe_mlp_ep_ring_local capacity overflow path (where tokens exceed local_capacity) is not explicitly tested. A test with a very small capacity_factor could cover this.

11. activation.py changes

Clean and correct. _relu2_jax matches the haliax relu_squared implementation, and both mapping dicts are updated consistently.

Summary

Good refactor overall. The main actionable items are:

Add jaxtyping annotations (per self-review)
Derive batch_spec from x.sharding (per self-review)
Consider making MoEMLP activation configurable rather than hardcoded
Use isinstance instead of callable for activation dispatch clarity
Factor out the shared routing preamble if you want to reduce duplication

dlwh · 2026-02-26T07:54:06Z

🤖 Followed up on the extra Claude review points as part of commit 01757f4c5:

switched MoE activation dispatch to explicit isinstance(activation, ActivationFunctionEnum) handling
adjusted EP bincount weighting to use float weights for clarity/stability
kept the new batch_spec derivation from x.sharding.spec[0] with mesh fallback

I did not make MoEMLP activation configurable yet; it remains intentionally hardcoded to SiLU in the module wrapper, while moe_mlp(...) stays configurable for direct functional callers.

…sible-defaults

dlwh · 2026-02-28T07:25:56Z

@claude ptal

claude · 2026-02-28T07:26:27Z

Claude finished @dlwh's task in 2m 27s —— View job

🤖

Read changed files and gather context
Review grug_moe.py (core MoE implementation)
Review experiments/grug/moe/ (template variant)
Review test_variant_contracts.py and test_grugformer_moe.py
Review remaining changes (activation.py, test_grug_base_template.py, README.md)
Check follow-up commit addressing previous review points
Post comprehensive review

Review: Grug MoE Module, Template Variant, and Contracts

The core MoE implementation in grug_moe.py is solid — the functional moe_mlp / equinox MoEMLP split is clean, the EP ring path is well-documented with issue references, and the follow-up commit (01757f4c5) correctly addressed the isinstance dispatch, float bincount weights, and batch_spec derivation from x.sharding. The activation.py changes are correct, the variant contract test pattern is a nice addition, and test coverage is strong.

I have a few items below, ranging from structural concerns to minor nits.

1. Near-total duplication between `grug_moe.py` and `experiments/grug/moe/model.py`

model.py copies CausalSelfAttention, RMSNorm, Block, Transformer, GrugModelConfig, _mesh_has_axis, _mesh_axis_size, _batch_spec, _init_weight, and _shared_dense_mlp verbatim (or near-verbatim) from grug_moe.py, then only imports moe_mlp and MoeActivation from it. That's ~250 lines of duplicated code.

I understand the template philosophy ("variants iterate independently"), but right now the only MoE-specific code in model.py is the MoEMLP class (~70 lines) and debug_mesh_and_token_pspec (~20 lines). Everything else — attention, norms, the transformer shell, config — is identical. If these diverge independently, the shared correctness surface doubles; if they don't diverge, the duplication is pure maintenance cost.

Consider either:

Importing the non-MoE building blocks from grug_moe.py (or from a shared grug_blocks module), or
At minimum, adding a comment at the top of model.py noting which classes are intentional copies vs which are expected to stay in sync with grug_moe.py.

2. Weight naming divergence creates checkpoint incompatibility

grug_moe.py's MoEMLP uses field names moe_router, moe_w13, moe_w2, shared_w13, shared_w2, while model.py's MoEMLP uses router, w_up_gate, w_down, shared_w_up_gate, shared_w_down. Since Equinox serializes by field name, checkpoints from one implementation can't load into the other. If these are truly meant to be two distinct checkpointing surfaces, that's fine but worth documenting. If they should be interchangeable, the field names need to align.

3. Loss function naming inconsistency across variants

The base variant uses compute_next_token_loss while the MoE variant uses next_token_loss. The variant contract test (test_variant_contracts.py:46-51) works around this with _infer_loss_fn_name and hasattr checks, but this means:

Variants aren't drop-in replaceable
The train loop in moe/train.py is tightly coupled to the MoE naming

Standardizing on one name (preferably next_token_loss to match the Levanter library convention in grug_moe.py) across both variants would clean this up. The _infer_loss_fn_name heuristic could then become a simple assertion.

4. `moe/train.py` is 482 lines with only 4 lines different from `base/train.py`

The diff between the two train files is:

< from experiments.grug.base.model import GrugModelConfig, Transformer
> from experiments.grug.moe.model import GrugModelConfig, Transformer
< per_pos_loss = model.compute_next_token_loss(
> per_pos_loss = model.next_token_loss(
< glu=False,
> glu=True, num_experts=..., num_shared_experts=..., num_experts_per_tok=...
< return compute_params.compute_next_token_loss(
> return compute_params.next_token_loss(

If the loss function name were standardized (point 3), the train loop could be shared entirely with just the model import and FLOP params as variant-specific config. This would avoid 480 lines of copy-paste that will need to be kept in sync for every train loop improvement.

5. EP capacity overflow is silent

In _moe_mlp_ep_ring_local (grug_moe.py:293), when local_count > local_capacity, excess tokens are silently dropped via jnp.nonzero(..., size=local_capacity, fill_value=0). This is fine for JIT-friendliness but could silently degrade model quality under skewed routing. Consider adding a debug counter (even if it's only enabled under a flag or logged periodically) so users can detect when this is happening.

6. Shared expert weights are fully replicated

In both MoEMLP.init implementations, shared_w13 and shared_w2 use P(None, None) — fully replicated. For the trial config this is fine (shared_expert_intermediate_dim=1792), but at larger scales this becomes significant memory overhead. The base variant's dense MLP shards along ("data", "model") / ("model", "data") axes. Consider sharding the shared expert weights similarly, or adding a comment noting the replication is intentional and why.

7. Minor: `AssertionError` typos in `test_variant_contracts.py`

Lines 32 and 51 raise AssertionError — this works because Python doesn't validate exception class names at parse time, but it's misspelled (should be AssertionError → AssertionError). Wait, actually looking again — Python has AssertionError... no, it's AssertionError. Let me re-check: Python's built-in is AssertionError.

Actually, I checked again — the correct Python exception is AssertionError. Looking at lines 32 and 51:

raise AssertionError(f"No grug variants with model.py found under {grug_dir}")
raise AssertionError("Transformer variant must define either compute_next_token_loss or next_token_loss")

Hmm, AssertionError — that is indeed the correct spelling. My mistake, this is fine.

8. Minor: `_shared_dense_mlp` signature divergence

In grug_moe.py, _shared_dense_mlp takes activation_fn: Callable (pre-resolved). In model.py, it takes activation: MoeActivation (resolves internally). This means the model.py version does its own enum-to-callable dispatch redundantly. Minor, but it's another surface where the two files can drift apart.

Summary

The MoE core (grug_moe.py) is well-implemented and well-tested. The main structural concern is the volume of near-verbatim duplication between the library module and the experiment template — the two model.py + train.py files together add ~850 lines of copy-paste for ~90 lines of MoE-specific logic. If the template philosophy is firm, I'd suggest at least standardizing the loss function name and weight field names across variants to reduce the divergence footprint.

Actionable items by priority:

Standardize loss function naming (next_token_loss everywhere) — low effort, removes a structural divergence and simplifies the contract test
Align MoEMLP field names between grug_moe.py and model.py if checkpoint interop is desired
Consider sharing the train loop with variant-specific config rather than full copy
Consider sharing non-MoE building blocks from grug_moe.py into model.py
Add capacity overflow observability to the EP path
Document shared expert replication decision

View job

dlwh · 2026-02-28T07:31:31Z

🤖 Grug variant diff report

New Variant	Closest Existing Variant	Distance Score	Diff
`moe`	`base`	276	Open

dlwh added 4 commits February 17, 2026 12:41

docs: add agent-directed research logbook recipe

61a8a38

Add canonical Grug MoE block with EP ring path

6fc3787

Add v5p-8 MoE EP/DP microbench snapshot (refs #2710)

dda102c

Refactor grug MoE core and unify activation handling

0836e57

dlwh added the agent-generated label Feb 26, 2026

dlwh added 4 commits February 25, 2026 22:53

Drop local MoE bench logbook from PR

c45ea95

Drop local MoE bench tmp scripts from PR

5084dea

Reset docs drift files to origin/main

8d47ad2

Merge remote-tracking branch 'origin/main' into codex/grug-moe-ep-sen…

22ebe2c

…sible-defaults

dlwh commented Feb 26, 2026

View reviewed changes

lib/levanter/src/levanter/grug/grug_moe.py Outdated Show resolved Hide resolved

lib/levanter/src/levanter/grug/grug_moe.py Outdated Show resolved Hide resolved

lib/levanter/src/levanter/grug/grug_moe.py Outdated Show resolved Hide resolved

Address PR review follow-ups in grug MoE

01757f4

dlwh added 2 commits February 27, 2026 17:22

Merge remote-tracking branch 'origin/main' into codex/grug-moe-ep-sen…

3ab54cb

…sible-defaults

refactor grug moe routing boundary and template contracts

901bf8b

dlwh changed the title ~~Refactor Grug MoE core and activation handling~~ Add canonical Grug MoE module, template variant, and contracts Feb 28, 2026

Consolidate grug template tests into variant contracts

6ccc01c

dlwh added 2 commits February 27, 2026 23:41

Address remaining Grug MoE review feedback

96ccede

Clarify grug duplication expectations in PR review guide

f53d128

dlwh mentioned this pull request Feb 28, 2026

Add Grug variant visual diff tooling and PR workflow #3127

Open

Return EP MoE dropped assignment count instead of debug logging

99f223a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add canonical Grug MoE module, template variant, and contracts#3046

Add canonical Grug MoE module, template variant, and contracts#3046
dlwh wants to merge 15 commits intomainfrom
codex/grug-moe-ep-sensible-defaults

dlwh commented Feb 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlwh commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

dlwh commented Feb 26, 2026

Uh oh!

dlwh commented Feb 28, 2026

Uh oh!

claude bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

dlwh commented Feb 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dlwh commented Feb 26, 2026

Uh oh!

claude bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Refactor Grug MoE core and activation handling

Agreeing with self-review items

Additional observations

Summary

Uh oh!

dlwh commented Feb 26, 2026

Uh oh!

dlwh commented Feb 28, 2026

Uh oh!

claude bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Grug MoE Module, Template Variant, and Contracts

1. Near-total duplication between grug_moe.py and experiments/grug/moe/model.py

2. Weight naming divergence creates checkpoint incompatibility

3. Loss function naming inconsistency across variants

4. moe/train.py is 482 lines with only 4 lines different from base/train.py

5. EP capacity overflow is silent

6. Shared expert weights are fully replicated

7. Minor: AssertionError typos in test_variant_contracts.py

8. Minor: _shared_dense_mlp signature divergence

Summary

Uh oh!

dlwh commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dlwh commented Feb 26, 2026 •

edited

Loading

claude bot commented Feb 26, 2026 •

edited

Loading

claude bot commented Feb 28, 2026 •

edited

Loading

1. Near-total duplication between `grug_moe.py` and `experiments/grug/moe/model.py`

4. `moe/train.py` is 482 lines with only 4 lines different from `base/train.py`

7. Minor: `AssertionError` typos in `test_variant_contracts.py`

8. Minor: `_shared_dense_mlp` signature divergence

dlwh commented Feb 28, 2026 •

edited

Loading