Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

thuwzt · 2024-12-20T04:02:01Z

Fix for the issue #1330 : [BUG] MoE load balancing loss is accumulated twice when using activation checkpointing

…th activation checkpointing

taehwakkwon · 2025-01-08T01:13:23Z

Thank you for fixing this issue. I have a question regarding the aux loss.

Although it appears to be doubled, I believe the actual training remains the same since recomputed is turned off. Aux loss is not accumulated to the optimizer. The reason I'm asking is that we previously had recomputed turned on for a period, but now we can turn it off. I'm confused about whether to double the moe-aux-loss-coeff. I don't think it's necessary to double it because loss value is added to aux_losses_tracker.

Megatron-LM/megatron/core/transformer/moe/moe_utils.py

Line 448 in 2d7c521

    
           tracker[name]["values"][layer_number - 1] += loss.detach()  # Aggregate the loss for the layer.

.

thuwzt · 2025-01-08T01:28:42Z

@taehwakkwon Yes, you understand correctly. This is merely a display bug related to the tracker and logger. The compute flow remains unaffected, regardless of whether --moe-layer-recompute is used or not.

github-actions · 2025-03-09T18:20:24Z

Marking as stale. No activity in 60 days.

sbhavani · 2026-01-06T05:27:11Z

Thanks @thuwzt for identifying and fixing this MoE aux loss double accumulation bug (#1330). Your fix has been incorporated via internal MR !2574 (commit e6d56d6), where you are credited as a co-author.
As part of our new year cleanup, we're closing this PR. Thanks again for your contribution!

fix: prevent double accumulation of load balancing loss and z-loss wi…

db035d6

…th activation checkpointing

thuwzt mentioned this pull request Dec 20, 2024

[BUG] MoE load balancing loss is accumulated twice when using activation checkpointing #1330

Closed

github-actions bot added the stale No activity in 60 days on issue or PR label Mar 9, 2025

sbhavani added bug Something isn't working module: debugging and removed stale No activity in 60 days on issue or PR labels Jul 25, 2025

sbhavani added the module: moe label Jan 6, 2026

sbhavani closed this Jan 6, 2026

github-actions bot added the community-request label Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

Uh oh!

thuwzt commented Dec 20, 2024 •

edited

Loading

Uh oh!

taehwakkwon commented Jan 8, 2025

Uh oh!

thuwzt commented Jan 8, 2025

Uh oh!

github-actions bot commented Mar 9, 2025

Uh oh!

sbhavani commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

Fix: prevent double accumulation of load balancing loss and z-loss wi… #1331

Uh oh!

Conversation

thuwzt commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taehwakkwon commented Jan 8, 2025

Uh oh!

thuwzt commented Jan 8, 2025

Uh oh!

github-actions bot commented Mar 9, 2025

Uh oh!

sbhavani commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thuwzt commented Dec 20, 2024 •

edited

Loading