[peft] fix: Add replica_id handling for dense LoRA adapters with TP > 1 by yaoyu-33 · Pull Request #2252 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-02-06T00:44:43Z

Summary

When using LoRA adapters on dense layers (non-MoE) with TP > 1, only one TP shard was being saved during checkpointing. This caused significant training-inference loss discrepancy because other TP ranks loaded zero/uninitialized adapter weights.

Root Cause

In ParallelLinearAdapter.sharded_state_dict(), there was special replica_id handling for expert adapters (added in PR #1564, related to verl-project/verl#4303), but dense adapters never received the equivalent fix.

For dense adapters with TP > 1:

Each TP rank generates a ShardedTensor with the same replica_id
Shards should be distinguished by global_offset, but during the PEFT-filtered checkpoint save, TP shards are incorrectly deduplicated
Result: Only TP rank 0's shard is saved

Solution

Add replica_id handling for dense adapters similar to expert adapters. When TP > 1, the replica_id is adjusted to include the TP rank, ensuring each TP shard is correctly identified and saved during PEFT-filtered checkpoint saves.

Changes

Add replica_id adjustment for dense adapters with TP > 1 in ParallelLinearAdapter.sharded_state_dict()
Add unit tests for the fix covering:
- Dense adapters with TP > 1 (replica_id correctly updated)
- Dense adapters with TP = 1 (no change to replica_id)
- Expert adapters (still use EP-based replica_id, not affected)

Testing

Unit tests added
Functional tests with multi-GPU TP > 1 (requires CI)

Summary by CodeRabbit

Bug Fixes
- Fixed replica ID assignment for adapter checkpoints when using tensor model parallel configurations, ensuring correct deduplication during checkpoint saving.
Tests
- Added comprehensive test coverage for dense and expert adapter scenarios across different parallel configurations.

When using LoRA adapters on dense layers (non-MoE) with TP > 1, only one TP shard was being saved during checkpointing. This caused significant training-inference loss discrepancy because other TP ranks loaded zero/uninitialized adapter weights. The fix for expert adapters already existed (PR #1564, related to verl-project/verl#4303), but dense adapters never received the equivalent fix. This commit adds replica_id handling for dense adapters similar to expert adapters. When TP > 1, the replica_id is adjusted to include the TP rank, ensuring each TP shard is correctly identified and saved during PEFT-filtered checkpoint saves. Changes: - Add replica_id adjustment for dense adapters with TP > 1 in ParallelLinearAdapter.sharded_state_dict() - Add unit tests for the fix covering: - Dense adapters with TP > 1 (replica_id correctly updated) - Dense adapters with TP = 1 (no change to replica_id) - Expert adapters (still use EP-based replica_id, not affected)

copy-pr-bot · 2026-02-06T00:44:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-02-06T00:44:54Z

/ok to test acaef1a

coderabbitai · 2026-02-06T00:48:46Z

📝 Walkthrough

Walkthrough

This pull request modifies PEFT state dict handling for dense adapters with tensor model parallel (TP) size greater than 1, adjusting replica_id values for correct checkpoint deduplication. Comprehensive test coverage is added for parallel linear adapter state dict scenarios across different parallelism configurations.

Changes

Cohort / File(s)	Summary
PEFT State Dict Logic `src/megatron/bridge/peft/utils.py`	Removed in-function import of `parallel_state` and added new else branch for dense adapters with TP > 1 to adjust `replica_id` to `(original_dim0, tp_rank, original_dim2)` across `linear_in_sd` and `linear_out_sd` for correct deduplication when saving PEFT-filtered checkpoints.
Test Coverage `tests/unit_tests/peft/test_utils.py`	Added three new test cases for ParallelLinearAdapter sharded state dict handling: one for dense adapters with TP > 1 verifying `replica_id` includes TP rank, one for TP = 1 verifying no changes, and one for expert adapters verifying EP-based calculation is unaffected by dense adapter fix.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes	✅ Passed	PR qualifies as minor changes: small focused code change (+14 lines) with comprehensive unit tests (+157 lines) covering key scenarios, though functional tests pending in CI.
Title check	✅ Passed	The title clearly and concisely describes the main change: fixing replica_id handling for dense LoRA adapters when TP > 1, which is the primary purpose of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/dense-lora-adapter-tp-replica-id

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…with TP=2 Add TestLoRAFinetuneTP2 class with two tests: - test_lora_save_and_resume_tp2: End-to-end save/resume test with TP=2 - test_lora_weights_preserved_after_save_load_tp2: Explicit verification that loaded adapter weights exactly match saved weights on all TP ranks The second test specifically catches the replica_id bug by: 1. Capturing adapter weights before checkpoint save 2. Loading checkpoint into fresh model 3. Comparing loaded vs saved weights 4. Failing with clear error if loaded weights are all zeros (bug symptom)

yaoyu-33 · 2026-02-06T01:24:36Z

/ok to test 8094b2b

priyatham-resolve

LGTM. Thanks for the fix

priyatham-resolve · 2026-02-06T01:52:31Z

tests/functional_tests/training/test_finetune_lora.py

        return pretrain_checkpoint_dir, pretrain_tensorboard_dir, lora_checkpoint_dir, lora_tensorboard_dir
+
+
+class TestLoRAFinetuneTP2:


nit: This seems to duplicate all the helper methods from TestLoRAFinetune above. Could share them via a base class or mixin to reduce the ~300 lines of boilerplate? Not a blocker.

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 00:45 Inactive

copy-pr-bot bot temporarily deployed to test February 6, 2026 00:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 01:25 Inactive

yaoyu-33 changed the title ~~fix(peft): Add replica_id handling for dense LoRA adapters with TP > 1~~ [peft] fix: Add replica_id handling for dense LoRA adapters with TP > 1 Feb 6, 2026

copy-pr-bot bot temporarily deployed to test February 6, 2026 01:25 Inactive

priyatham-resolve approved these changes Feb 6, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 02:01 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 6, 2026 02:08 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[peft] fix: Add replica_id handling for dense LoRA adapters with TP > 1#2252

[peft] fix: Add replica_id handling for dense LoRA adapters with TP > 1#2252
yaoyu-33 wants to merge 2 commits intomainfrom
fix/dense-lora-adapter-tp-replica-id

yaoyu-33 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

yaoyu-33 commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

yaoyu-33 commented Feb 6, 2026

Uh oh!

priyatham-resolve left a comment

Uh oh!

priyatham-resolve Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return pretrain_checkpoint_dir, pretrain_tensorboard_dir, lora_checkpoint_dir, lora_tensorboard_dir


		class TestLoRAFinetuneTP2:

Conversation

yaoyu-33 commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Solution

Changes

Testing

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

yaoyu-33 commented Feb 6, 2026

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

yaoyu-33 commented Feb 6, 2026

Uh oh!

priyatham-resolve left a comment

Choose a reason for hiding this comment

Uh oh!

priyatham-resolve Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading