-
Notifications
You must be signed in to change notification settings - Fork 90
Open
Labels
Description
Reference
- MoE Odyssey Transfer HF data files from HF to GCS #6: Optimal Allocation for Equilibrium (Quantile Balancing):
https://datasets.osmarks.net/kexue/site/11619-MoE-Odyssey-6.-Optimal-Allocation-for-Equilibrium.html
Objective
Run and evaluate a true Quantile Balancing (QB) load-balancing experiment for OLMoE, replacing the previous equilibrium proxy that produced misleading aux dynamics.
Context
We previously observed an unhealthy pattern in the old stab4-equilibrium implementation:
train/equilibrium_lb_bias_lossbecame strongly negative and distorted totaltrain/lossinterpretation.- MoE routing could remain collapsed (high
moe/load_violation_max) despite apparent loss improvement.
A corrected implementation is now in-tree:
- Alternating quantile QB target (
alpha,beta,b*) per MoE Odyssey - Bias-target objective (
0.5 * scale * ||b - stop_grad(b*)||^2) instead of signed linear surrogate - Config knob:
equilibrium_lb_iterations
Experiment Plan
- Run OLMoE-M
stab4onnemotron_ccwith:- TPU:
v5litepod-64 - Seq len:
4096 - Global batch:
128 - Token target:
40B - LRs:
7.5e-4, 1e-3, 2e-3, 3e-3 equilibrium_lb_loss_scale=0.01equilibrium_lb_iterations=5
- TPU:
- Compare against prior stability baselines (
olmoe_m,olmoe_m_stab3) on same dataset/compute envelope. - Evaluate first 2k, 10k, and 50k step windows before deciding full-run continuation for all LRs.
Primary Metrics
- Training behavior:
train/losstrain/router_z_losstrain/equilibrium_lb_bias_loss(should be well-behaved, non-pathological)
- MoE balance:
moe/load_violation_maxmoe/equilibrium_rel_load_violation_maxmoe/equilibrium_quantile_prob_mean- per-layer expert load histograms / routing entropy
- Reliability:
- no OOM / vmem failures
- stable throughput / no chronic runtime-env failures
Success Criteria
- No pathological loss artifact from equilibrium term (no large negative drift masking CE behavior).
- MoE load balance trends improve vs pre-fix run (downward or clearly lower violation trajectory).
- Stable training + W&B logging online for sweep runs.
- At least one LR candidate demonstrates healthy early-phase convergence and routing behavior.
Current Run Artifact
- Running fixed sweep submission:
raysubmit_sF79814LVXC1Jxe6 - Current W&B run (first LR):
https://wandb.ai/marin-community/olmoe_m/runs/s4096_b128_euw4_v5l-174603-d4ba1c
Tasks
- Compare QB-fixed stab4 vs prior stab3/base on matched intervals.
- Summarize whether QB should remain in stab4 default for future 40B-token sweeps.
Reactions are currently unavailable