Skip to content

MoE speedrun: Mixtral load balance + v4 smoke preset#3094

Open
pc0618 wants to merge 7 commits intomainfrom
pc0618/pr-moe-speedrun-wandb-mfu
Open

MoE speedrun: Mixtral load balance + v4 smoke preset#3094
pc0618 wants to merge 7 commits intomainfrom
pc0618/pr-moe-speedrun-wandb-mfu

Conversation

@pc0618
Copy link
Contributor

@pc0618 pc0618 commented Feb 27, 2026

Summary

  • Mixtral: add equilibrium-bias load balancing + router fp32 option.
  • Speedrun: add olmoe_s preset for v4 smoke (conservative cross_entropy_block_size=512).
  • Speedrun: honor --seq-len for training by wiring train_seq_len.
  • Mixtral HF export: omit router_bias from state dict for compatibility.
  • Grugformer MoE: avoid v4 auto-axis sharding failures.
  • Speedrun/W&B: refresh logging defaults + archive/profiling flags for grugformer_moe runs.

Smoke test (Levanter, v4-8)

  • Command:
    uv run python -m marin.run.ray_run --cluster infra/marin-us-central2.yaml --tpu v4-8 --env_vars WANDB_MODE=online -- python experiments/speedrun/olmoe_1b7b_nemotron_40b.py --model olmoe_s --tpu-type v4-8 --global-batch-size 32 --seq-len 1024 --num-train-steps 20 --dataset nemotron_cc --run-suffix pr-smoke-v4-8-b32-s1024-t20-20260227-000049
  • Artifacts:
  • Result highlights: 20 steps, global_bs=32, seq_len=1024, model_size=158.69M params; MFU (model_flops / peak_hw_flops) ≈ 0.68%.

- Document grugformer MoE entrypoints in docs/reports/grug-archive.md

- Add CLI switches for profiling, jaxpr/HLO artifact logging, and perfetto link generation

- Default to legacy axis resources + non-explicit mesh axes for higher MFU parity with levanter MoE runs

- Use cached Nemotron Llama3 tokenized components in olmoe_1b7b speedrun and allow CE block-size override
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant