Updated amax_sync test to set const weights based on rank #451

kinjalpatel27 · 2025-10-17T22:39:23Z

What does this PR do?

Bug fix

Overview:
For amax sync test, set the weights explicitly with constant value based on each rank which should ensure different amax without synchronization

Usage

pytest -svvv gpu/torch/quantization/plugins/test_megatron.py::test_expert_parallel_sync

Testing

Ran the above mentioned pytest

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: NA
Did you add or update any necessary documentation?: NA
Did you update Changelog?: NA

Additional Information

Summary by CodeRabbit

New Features
- Added mention of MCore MoE support for PTQ/QAT/QAD in release notes.
Tests
- Enhanced distributed quantization tests to use rank-aware data initialization, improving validation of activation-statistic synchronization across expert-parallel model configurations.

Signed-off-by: Kinjal Patel <[email protected]>

coderabbitai · 2025-10-17T22:39:36Z

Walkthrough

Test initialization was changed to set per-parameter, rank-dependent weight values (excluding embeddings) and to create a rank-scaled prompt_tokens; this produces distinct amax values across ranks to exercise amax synchronization in expert-parallel quantization tests.

Changes

Cohort / File(s)	Summary
Test initialization refactor `tests/gpu/torch/quantization/plugins/test_megatron.py`	Replaced single `prompt_tokens` initialization with per-parameter weight initialization: iterates model parameters, targets weight matrices (names containing `"weight"` and not `"embedding"`), fills trainable params with dim ≥ 2 using rank- and index-dependent constants, and sets `prompt_tokens` to a rank-aware long CUDA tensor. This produces distinct amax values across ranks for synchronization testing.
Changelog update `CHANGELOG.rst`	Added a bullet under "New Features" noting support for MCore MoE PTQ/QAT/QAD. No other structural changes.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Test as Test Process
participant Model as Model (per-rank)
participant CUDA as CUDA tensors
participant Sync as amax Sync/Check

Note over Test,Model: Initialization phase (per-rank)
Test->>Model: iterate parameters
alt param is weight && not embedding && dim >= 2
    Test->>Model: fill param with rank- and index-dependent constant
    Model-->>CUDA: parameter placed on device
else other params
    Test->>Model: leave unchanged or default init
end

Note over Test,CUDA: prompt_tokens setup
Test->>CUDA: create rank-aware sequence -> to long CUDA tensor

Note over Test,Sync: validation phase
Test->>Sync: compute local amax values
Sync->>Sync: aggregate/compare amax across ranks
Sync-->>Test: synchronization result (pass/fail)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through tensors, row by row,

Per-rank I painted weights to show,
Amaxes shimmer, not the same,
Experts nod — they join the game,
Testing hops, and off I go! 🎩🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Updated amax_sync test to set const weights based on rank" accurately captures the main change in the changeset. The primary modification updates the `_test_expert_model_parallel_amax_sync` test function to initialize weight tensors with rank-dependent constant values instead of a single prompt_tokens initialization, which aligns directly with what the title describes. The title is specific, concise (10 words), and uses clear technical terminology that a reviewer can quickly understand. While the PR also includes a changelog entry as a secondary change, the core substance of the PR is the test modification, which the title appropriately highlights.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kinjal/fix_test_amax_sync

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bed2ecd and d7dd6ad.

📒 Files selected for processing (1)

CHANGELOG.rst (1 hunks)

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/gpu/torch/quantization/plugins/test_megatron.py (1)
691-691: Consider improving prompt_tokens variation across ranks.

The current formula (torch.ones(...) * 0.05 + rank * 0.5).long() produces identical values for consecutive rank pairs (ranks 0-1 get 0, ranks 2-3 get 1, etc.). While this doesn't break the test (since weights already differ), you could improve variation:
-    prompt_tokens = (torch.ones((2, model.max_sequence_length)) * 0.05 + rank * 0.5).cuda().long()
+    prompt_tokens = (torch.ones((2, model.max_sequence_length)) * rank).cuda().long()
This creates unique token indices for each rank: 0, 1, 2, 3, etc., providing better per-rank distinction.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ef9954 and bed2ecd.

📒 Files selected for processing (1)

tests/gpu/torch/quantization/plugins/test_megatron.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/gpu/torch/quantization/plugins/test_megatron.py (2)

modelopt/torch/utils/distributed.py (2)

rank (68-72)

rank (200-202)

modelopt/torch/quantization/src/tensor_quant_mx.cu (4)

cuda (75-101)

cuda (75-76)

cuda (105-137)

cuda (105-105)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

tests/gpu/torch/quantization/plugins/test_megatron.py (1)

677-690: Excellent approach to create rank-dependent weight values.

The weight initialization logic correctly:

Filters for trainable weight matrices (excluding embeddings and biases)

Assigns distinct constant values per rank and parameter index

Ensures different amax values across ranks for proper synchronization testing

The filtering conditions are appropriate and the rank-dependent formula 0.1 + (rank * 0.5) + (weight_idx * 0.05) effectively creates the needed variation.

codecov · 2025-10-17T22:52:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.39%. Comparing base (6ef9954) to head (d7dd6ad).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #451      +/-   ##
==========================================
+ Coverage   73.38%   73.39%   +0.01%     
==========================================
  Files         180      180              
  Lines       17942    17976      +34     
==========================================
+ Hits        13166    13194      +28     
- Misses       4776     4782       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kinjal Patel <[email protected]>

updated test to set const weights to generate diff amax

bed2ecd

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 requested review from jenchen13 and realAsma October 17, 2025 22:39

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

realAsma approved these changes Oct 20, 2025

View reviewed changes

updated Changelog for MoE support

d7dd6ad

Signed-off-by: Kinjal Patel <[email protected]>

kinjalpatel27 merged commit d19f4f5 into main Oct 21, 2025
27 checks passed

kinjalpatel27 deleted the kinjal/fix_test_amax_sync branch October 21, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated amax_sync test to set const weights based on rank #451

Updated amax_sync test to set const weights based on rank #451

Uh oh!

kinjalpatel27 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Updated amax_sync test to set const weights based on rank #451

Updated amax_sync test to set const weights based on rank #451

Uh oh!

Conversation

kinjalpatel27 commented Oct 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kinjalpatel27 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading

codecov bot commented Oct 17, 2025 •

edited

Loading