Skip to content

Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B…#2247

Open
rhmukundan wants to merge 2 commits intomainfrom
rmukundan/llama3_lora_tp_overlap_packed_seq
Open

Enabling TP Comm Overlap and Packed Sequencing Configs for LLAMA3 70B…#2247
rhmukundan wants to merge 2 commits intomainfrom
rmukundan/llama3_lora_tp_overlap_packed_seq

Conversation

@rhmukundan
Copy link
Contributor

@rhmukundan rhmukundan commented Feb 5, 2026

Enabling TP Comm Overlap if TP > 1 (for GB200 and H100) and packed sequencing configs (for GB200)

Summary by CodeRabbit

  • Performance Improvements
    • Enabled communication overlap optimization for gb200, gb300, and h100 GPU configurations to enhance fine-tuning efficiency.
    • Added CUDA graph compatibility improvements for gb200 and gb300 setups with updated sequence handling and padding configurations.

… LoRa

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rhmukundan rhmukundan self-assigned this Feb 5, 2026
@rhmukundan rhmukundan enabled auto-merge (squash) February 5, 2026 22:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Configuration modifications to enable communication overlap and CUDA graph compatibility for Llama 3 models across multiple hardware platforms (gb200, gb300, h100). Changes affect llama3_70b variants for both SFT and LORA training modes.

Changes

Cohort / File(s) Summary
Llama3 Configuration Enhancement
scripts/performance/configs/llama/llama3_llm_finetune.py
Enables comm_overlap setting for llama3_70b SFT and LORA configurations (gb200, gb300, h100). For gb200/gb300 variants, adds CUDA graph compatibility through pad_cu_seqlens and pad_to_max_length adjustments to support packed sequences.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR enables significant tensor-parallel communication overlap optimizations for LLAMA3 70B across multiple hardware platforms but lacks test results, performance metrics, and regression testing validation. Add performance benchmarks comparing enabled vs disabled TP overlap, test results across targeted hardware, convergence metrics for packed sequencing, and before-after throughput numbers.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: enabling TP communication overlap and packed sequencing configs for LLAMA3 70B, which directly aligns with the changeset that modifies llama3_70b configuration files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rmukundan/llama3_lora_tp_overlap_packed_seq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@malay-nagda malay-nagda added performance performance/release Performance items related with NeMo release performance/optimize Performance optimization tracking r0.3.0 Cherry-pick label for r0.3.0 release branch labels Feb 6, 2026
@malay-nagda malay-nagda added this to the 26.02 milestone Feb 6, 2026
@rhmukundan
Copy link
Contributor Author

/ok to test 89d9b2a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/optimize Performance optimization tracking performance/release Performance items related with NeMo release performance r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants