grug/moe: restore aux-loss metrics + add smoke launcher#3229
grug/moe: restore aux-loss metrics + add smoke launcher#3229
Conversation
|
Claude finished @dlwh's task in 1m 8s —— View job Review
Restores router aux-loss metrics (load balancing, z-loss, entropy, per-layer histograms) through the MoE forward pass and train loop, adds a configurable TPU type for launches, and provides a smoke script. Fulfills the stated objectives. Findings:
No other bugs, correctness issues, or guideline violations found. The aux-loss computation (coefficient-weighted sum added to CE loss, with raw and weighted terms logged separately) follows the MaxText/Megatron pattern described in comments. Router stats ( |
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
|
@dlwh-golem will you try out this branch on your local machine, modifying to be gpus=1 in launch.py and probably smaller batch. Just get it to run like 10 steps. change to be TinyStories like in the tutorial gpu run |
|
I updated
What changed for the requested local smoke setup:
Run attempt:
So the branch now matches the requested 1-GPU/TinyStories/10-step smoke config, but I couldn’t validate the 10 training steps in this runtime due missing CUDA JAX. Codex work
Codex transcript |
|
@dlwh-golem you need to install the gpu packages. follow the tutorial for gpus |
|
Followed the GPU tutorial setup path and installed the GPU Python stack:
Then I retried the PR smoke run:
Result:
Quick check on this worker also shows If you want, next I can run the same command outside sandbox on the host runtime (or on a different GPU host) to confirm it reaches 10 steps. Codex work
Codex transcript |
|
@dlwh-golem try one more time. fixed some config on your end |
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
4 similar comments
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
|
Sorry, I hit an internal failure while handling this mention and couldn't complete the request. Please mention |
|
Retried just now after syncing to the latest branch commit ( UV_CACHE_DIR=/tmp/uv-cache \
HF_HOME=/tmp/hf-home \
HF_DATASETS_CACHE=/tmp/hf-home/datasets \
HUGGINGFACE_HUB_CACHE=/tmp/hf-home/hub \
WANDB_MODE=disabled \
uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smokeResult is still the same failure in this harness runtime:
I also attempted to rerun it outside the sandbox (so it can use host GPU drivers directly), but that escalation was not approved in this run, so I’m blocked from validating the 10 training steps here. Exact next action needed:
Codex work
Codex transcript |
|
@dlwh-golem will you try out this branch on your local machine. be sure you instlal the gpu packages following the gpu tutorial |
|
I retried on this branch locally and followed the GPU tutorial path.
UV_CACHE_DIR=/tmp/uv-cache \
HF_HOME=/tmp/hf-home \
HF_DATASETS_CACHE=/tmp/hf-home/datasets \
HUGGINGFACE_HUB_CACHE=/tmp/hf-home/hub \
WANDB_MODE=disabled \
uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smokeResult:
So GPU package setup is now fine; current blocker is mesh config for the MoE path (missing Codex work
Codex transcript |
|
@dlwh-golem ok fix tghat by adding expert: 1 to the mesh in either train.py or launch.py and then keep fixing things until it runs or you're out of ideas |
|
Blocked on environment permissions before I can apply the fix. I attempted to do exactly what you asked:
But this run is in a read-only sandbox, and escalation requests are being rejected, so:
Planned code change (not yet applied due the block):
Exact next action needed:
Codex work
Codex transcript |
|
@dlwh-golem please try again. continue prior work; first read full PR comments/review thread before making changes.” |
|
Read the full PR thread/reviews first, then continued from prior work. Pushed one incremental fix commit to this PR branch:
What I changed in
Re-run command (same as before):
Current blocker after these fixes:
I tried additional sharding-alignment edits and reruns, but they did not resolve this cleanly, so I reverted those speculative changes and left only the two confirmed fixes above. Codex work
Codex transcript |
|
@dlwh-golem i fixed that issue but now there's a triton lowering thing. can you at least minimize and ideally fix |
|
Minimized to a single call site:
So the minimal workaround is to force this path to XLA in grug loss: loss = fused_cross_entropy_loss_and_logsumexp_penalty(
...,
implementation="xla",
)I attempted to apply exactly that edit in Exact next action needed:
Codex work
Codex transcript |
|
@dlwh-golem i thought we gave you full auto |
|
You’re right to expect full-auto. This runner is still enforcing manual escalation and is rejecting every write command, so I can’t execute the fix. What I attempted immediately after your note (all rejected by the runner):
Exact next action needed to unblock me:
Codex work
Codex transcript |
Summary
experiments/grug/moe/model.pyGRUG_MOE_TPU_TYPE(defaultv6e-8)experiments/grug/moe/smoke_v6e8_aux_losses.pyfor small aux-loss smoke launchesorigin/maininto this branchValidation
./infra/pre-commit.py --all-files