Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B… by rhmukundan · Pull Request #2247 · NVIDIA-NeMo/Megatron-Bridge

rhmukundan · 2026-02-05T22:38:23Z

Enabling TP Comm Overlap if TP > 1 (for GB200 and H100) and packed sequencing configs (for GB200)

Summary by CodeRabbit

Performance Improvements
- Enabled communication overlap optimization for gb200, gb300, and h100 GPU configurations to enhance fine-tuning efficiency.
- Added CUDA graph compatibility improvements for gb200 and gb300 setups with updated sequence handling and padding configurations.

… LoRa Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

copy-pr-bot · 2026-02-05T22:38:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-05T22:42:02Z

📝 Walkthrough

Walkthrough

Configuration modifications to enable communication overlap and CUDA graph compatibility for Llama 3 models across multiple hardware platforms (gb200, gb300, h100). Changes affect llama3_70b variants for both SFT and LORA training modes.

Changes

Cohort / File(s)	Summary
Llama3 Configuration Enhancement `scripts/performance/configs/llama/llama3_llm_finetune.py`	Enables `comm_overlap` setting for llama3_70b SFT and LORA configurations (gb200, gb300, h100). For gb200/gb300 variants, adds CUDA graph compatibility through `pad_cu_seqlens` and `pad_to_max_length` adjustments to support packed sequences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR enables significant tensor-parallel communication overlap optimizations for LLAMA3 70B across multiple hardware platforms but lacks test results, performance metrics, and regression testing validation.	Add performance benchmarks comparing enabled vs disabled TP overlap, test results across targeted hardware, convergence metrics for packed sequencing, and before-after throughput numbers.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: enabling TP communication overlap and packed sequencing configs for LLAMA3 70B, which directly aligns with the changeset that modifies llama3_70b configuration files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch rmukundan/llama3_lora_tp_overlap_packed_seq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rhmukundan · 2026-02-06T16:51:07Z

/ok to test 89d9b2a

rhmukundan · 2026-02-10T16:59:05Z

/ok to test c39b8a8

Signed-off-by: Raghav Hrishikeshan Mukundan <102543536+rhmukundan@users.noreply.github.com>

#2247) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: Raghav Hrishikeshan Mukundan <102543536+rhmukundan@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

NVIDIA-NeMo#2247) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com> Signed-off-by: Raghav Hrishikeshan Mukundan <102543536+rhmukundan@users.noreply.github.com> Signed-off-by: sowmen <sowmendipta@gmail.com>

Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B…

0b928cd

… LoRa Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

rhmukundan requested a review from malay-nagda February 5, 2026 22:38

rhmukundan self-assigned this Feb 5, 2026

rhmukundan enabled auto-merge (squash) February 5, 2026 22:40

malay-nagda previously approved these changes Feb 6, 2026

View reviewed changes

malay-nagda added performance performance/release Performance items related with NeMo release performance/optimize Performance optimization tracking r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 6, 2026

malay-nagda added this to the 26.02 milestone Feb 6, 2026

Merge branch 'main' into rmukundan/llama3_lora_tp_overlap_packed_seq

89d9b2a

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 16:51 Inactive

copy-pr-bot bot temporarily deployed to test February 6, 2026 16:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 16:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 17:00 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 17:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 6, 2026 17:10 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 6, 2026 17:10 Inactive

rhmukundan mentioned this pull request Feb 6, 2026

[Test] Fix LoRA perf configurations of different GPUs #2061

Closed

Merge branch 'main' into rmukundan/llama3_lora_tp_overlap_packed_seq

f1bfb60

auto-merge was automatically disabled February 9, 2026 19:09
Pull Request is not mergeable

Merge branch 'main' into rmukundan/llama3_lora_tp_overlap_packed_seq

c39b8a8

copy-pr-bot bot temporarily deployed to nemo-ci February 10, 2026 16:59 Inactive

copy-pr-bot bot temporarily deployed to test February 10, 2026 16:59 Inactive

Merge branch 'main' into rmukundan/llama3_lora_tp_overlap_packed_seq

faab585

Signed-off-by: Raghav Hrishikeshan Mukundan <102543536+rhmukundan@users.noreply.github.com>

rhmukundan dismissed malay-nagda’s stale review via faab585 February 10, 2026 17:38

ko3n1g merged commit f396120 into main Feb 10, 2026
2 checks passed

ko3n1g deleted the rmukundan/llama3_lora_tp_overlap_packed_seq branch February 10, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B…#2247

Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B…#2247
ko3n1g merged 5 commits intomainfrom
rmukundan/llama3_lora_tp_overlap_packed_seq

rhmukundan commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

rhmukundan commented Feb 6, 2026

Uh oh!

rhmukundan commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhmukundan commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

rhmukundan commented Feb 6, 2026

Uh oh!

rhmukundan commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rhmukundan commented Feb 5, 2026 •

edited by coderabbitai bot

Loading