Skip to content

Experiment: OLMoE Quantile Balancing (MoE Odyssey) on Nemotron #3124

@pc0618

Description

@pc0618

Reference

Objective

Run and evaluate a true Quantile Balancing (QB) load-balancing experiment for OLMoE, replacing the previous equilibrium proxy that produced misleading aux dynamics.

Context

We previously observed an unhealthy pattern in the old stab4-equilibrium implementation:

  • train/equilibrium_lb_bias_loss became strongly negative and distorted total train/loss interpretation.
  • MoE routing could remain collapsed (high moe/load_violation_max) despite apparent loss improvement.

A corrected implementation is now in-tree:

  • Alternating quantile QB target (alpha, beta, b*) per MoE Odyssey
  • Bias-target objective (0.5 * scale * ||b - stop_grad(b*)||^2) instead of signed linear surrogate
  • Config knob: equilibrium_lb_iterations

Experiment Plan

  1. Run OLMoE-M stab4 on nemotron_cc with:
    • TPU: v5litepod-64
    • Seq len: 4096
    • Global batch: 128
    • Token target: 40B
    • LRs: 7.5e-4, 1e-3, 2e-3, 3e-3
    • equilibrium_lb_loss_scale=0.01
    • equilibrium_lb_iterations=5
  2. Compare against prior stability baselines (olmoe_m, olmoe_m_stab3) on same dataset/compute envelope.
  3. Evaluate first 2k, 10k, and 50k step windows before deciding full-run continuation for all LRs.

Primary Metrics

  • Training behavior:
    • train/loss
    • train/router_z_loss
    • train/equilibrium_lb_bias_loss (should be well-behaved, non-pathological)
  • MoE balance:
    • moe/load_violation_max
    • moe/equilibrium_rel_load_violation_max
    • moe/equilibrium_quantile_prob_mean
    • per-layer expert load histograms / routing entropy
  • Reliability:
    • no OOM / vmem failures
    • stable throughput / no chronic runtime-env failures

Success Criteria

  • No pathological loss artifact from equilibrium term (no large negative drift masking CE behavior).
  • MoE load balance trends improve vs pre-fix run (downward or clearly lower violation trajectory).
  • Stable training + W&B logging online for sweep runs.
  • At least one LR candidate demonstrates healthy early-phase convergence and routing behavior.

Current Run Artifact

Tasks

  • Compare QB-fixed stab4 vs prior stab3/base on matched intervals.
  • Summarize whether QB should remain in stab4 default for future 40B-token sweeps.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions