[Enhance] concat micro batch inputs before emb's and lm_head's forward #1409

HIT-cwh · 2026-01-04T06:26:44Z

Concat the micro-batch inputs to ensure the all-gather operation is executed only once.

Copilot

Pull request overview

This PR enhances performance by concatenating micro-batch inputs before passing them through embedding layers and the language model head, rather than processing each micro-batch separately. This optimization reduces the number of forward passes through these operations.

Key changes:

Renamed SequenceContext.pack() to SequenceContext.cat() to better reflect the concatenation operation
Added cat() class methods to BaseLossKwargs, BaseLossContext, and CELossContext to support concatenating loss contexts
Refactored _micro_batch_forward() in the MoE model to concatenate inputs before embedding and lm_head operations

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File	Description
xtuner/v1/data_proto/sequence_context.py	Renamed method from `pack` to `cat` for clarity
xtuner/v1/loss/base_loss_ctx.py	Added `cat()` class methods to support concatenating loss contexts across micro-batches
xtuner/v1/rl/base/controller.py	Updated method call from `pack` to `cat` following the rename
xtuner/v1/model/moe/moe.py	Refactored micro-batch forward to concatenate inputs before embeddings and lm_head, reducing forward pass overhead

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

xtuner/v1/loss/base_loss_ctx.py

xtuner/v1/model/moe/moe.py

Copilot · 2026-01-04T06:29:53Z

xtuner/v1/model/moe/moe.py

+        cat_loss_ctx = CELossContext.cat(loss_ctx_list)
+        loss, (logits, extra_info) = self.lm_head(cat_hidden_states, cat_loss_ctx)  # type: ignore

        # Aggregate losses (mean across micro-batches)


The comment says "Aggregate losses (mean across micro-batches)" but the code calls loss.sum(). If the loss returned by lm_head after concatenating all micro-batches is already a scalar (which is typical), calling .sum() on a scalar tensor is redundant and may be misleading. The comment also says "mean" but the code uses "sum". Consider clarifying whether the loss should be summed or averaged, and update either the code or the comment accordingly.

Suggested change

# Aggregate losses (mean across micro-batches)

# Aggregate loss value (using sum across micro-batches or scalar loss as returned)

Copilot · 2026-01-04T06:29:53Z

xtuner/v1/model/moe/moe.py


        # Aggregate losses (mean across micro-batches)
-        output["loss"] = torch.stack(loss_list).sum() if loss_list else None
+        loss: torch.Tensor


The type annotation loss: torch.Tensor on line 415 is redundant since loss was already assigned on line 412. This type annotation doesn't provide any new information and could be confusing as it appears between the assignment and usage of the variable. Consider removing this redundant annotation.

Suggested change

loss: torch.Tensor

Please modify the type annotation of LMHead.__call__, the type of loss could be inferred as toch.Tensor automatically.

If loss_ctx is not None, the call method will return a torch.Tensor; otherwise, it will return None. How should I modify the typehint? @HAOCHENYE

xtuner/v1/model/moe/moe.py

HAOCHENYE · 2026-01-06T04:26:54Z

Overview (Reviewed by Claude Code)

This PR refactors the MoE (Mixture of Experts) model's micro-batch forward pass to improve efficiency by:

Renaming pack() → cat() for SequenceContext (more accurate naming)
Adding a cat() method to BaseLossKwargs and BaseLossContext classes
Major optimization: Consolidating tensor operations in _micro_batch_forward() to reduce redundant computations

Analysis

✅ Positive Changes

Method Naming Improvement (sequence_context.py)
- Renaming pack → cat is more semantically accurate since the method concatenates tensors using torch.cat
New cat() Methods (base_loss_ctx.py)
- Well-structured implementation that mirrors the existing chunk() method
- Correctly uses dim=1 to reverse the chunk operation
- Proper assertions for input validation
Performance Optimization (moe.py)
- The old code computed embeddings per micro-batch in a loop, then concatenated later
- New code concatenates input tensors first, then does a single embedding lookup
- This reduces kernel launch overhead and improves GPU utilization

⚠️ Potential Issues & Suggestions

Variable Shadowing (moe.py:419)
loss, (logits, extra_info) = self.lm_head(cat_hidden_states, cat_loss_ctx)

...

loss: torch.Tensor # This type annotation shadows the variable
output["loss"] = loss.sum()
- The loss: torch.Tensor annotation is redundant and potentially confusing
- Suggestion: Remove the type annotation line
2. Unused Variable (moe.py:427-428)
final_logits = logits
- final_logits is assigned but the logic for handling it was removed. Need to verify it's still used downstream.
3. Logic Change Risk (moe.py:406-408)
cat_hidden_states = torch.cat(hidden_states_list, dim=1)
cat_hidden_states = self.norm(cat_hidden_states)
- This concatenates hidden_states_list again after it was already modified in the MoE branch
- The old code processed normalization per micro-batch; new code does it on concatenated tensor
- Question: Are these mathematically equivalent? LayerNorm across longer sequences vs. per micro-batch may differ slightly
4. Missing Assertion (moe.py:350-351)
if seq_ctx_list[0].input_ids is None:
- No check that seq_ctx_list is non-empty before accessing [0]
- Suggestion: Add assertion or guard
5. Comment Language Consistency (base_loss_ctx.py)
- Comments use Chinese (收集所有 tensor 字段名...)
- Consider using English for consistency with the rest of the codebase

🔍 Questions for Author

Has this been benchmarked to confirm the performance improvement?
Are there unit tests covering the new cat() methods?
Has the numerical equivalence been verified (especially for the LayerNorm change)?

Summary

Aspect	Assessment
Code Correctness	⚠️ Minor concerns (variable shadowing, potential semantic difference in norm)
Code Style	✅ Good overall, minor comment language inconsistency
Performance	✅ Should improve through batched operations
Test Coverage	❓ Unknown - tests not visible in diff
Security	✅ No concerns

Recommendation: Approve with minor revisions - address the variable shadowing and verify numerical equivalence of the LayerNorm change.

xtuner/v1/data_proto/sequence_context.py

xtuner/v1/model/moe/moe.py

HAOCHENYE · 2026-01-06T04:39:46Z

xtuner/v1/model/moe/moe.py


        # Aggregate losses (mean across micro-batches)
-        output["loss"] = torch.stack(loss_list).sum() if loss_list else None
+        loss: torch.Tensor


Please modify the type annotation of LMHead.__call__, the type of loss could be inferred as toch.Tensor automatically.

HAOCHENYE · 2026-01-06T04:42:50Z

xtuner/v1/model/moe/moe.py

        output["extra_info"] = moe_extra_info

+        # Return logits for all micro-batches
+        final_logits = logits


remove the variable final_logits

When constructing MoEModelOutputs, final_logits will be utilized.

[Enhance] concat micro batch inputs before emb's and lm_head's forward

e1c6c85

HIT-cwh requested review from HAOCHENYE and Copilot January 4, 2026 06:26

Copilot started reviewing on behalf of HIT-cwh January 4, 2026 06:27 View session

Copilot AI reviewed Jan 4, 2026

View reviewed changes

HAOCHENYE reviewed Jan 6, 2026

View reviewed changes

fix comments

3b422e7

HIT-cwh force-pushed the cwh/mbo_opt branch from 6d49604 to 3b422e7 Compare January 6, 2026 12:49

HAOCHENYE approved these changes Jan 6, 2026

View reviewed changes

HIT-cwh merged commit 2aa70dd into InternLM:main Jan 7, 2026
3 of 4 checks passed

	# Aggregate losses (mean across micro-batches)
	# Aggregate loss value (using sum across micro-batches or scalar loss as returned)

[Enhance] concat micro batch inputs before emb's and lm_head's forward #1409

[Enhance] concat micro batch inputs before emb's and lm_head's forward #1409

Uh oh!

Conversation

HIT-cwh commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HIT-cwh Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview (Reviewed by Claude Code)

Analysis

⚠️ Potential Issues & Suggestions

...

🔍 Questions for Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

HIT-cwh Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HIT-cwh commented Jan 4, 2026 •

edited

Loading

HAOCHENYE commented Jan 6, 2026 •

edited

Loading