Skip to content

Conversation

@IzzyPutterman
Copy link
Collaborator

@IzzyPutterman IzzyPutterman commented Dec 3, 2025

Summary by CodeRabbit

Release Notes

  • New Features
    • Added speculative decoding support for DeepSeekV3-based models, enabling accelerated inference through optimized draft model pathways
    • Extended model configuration compatibility with additional variants for broader model support

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@IzzyPutterman IzzyPutterman requested review from a team as code owners December 3, 2025 19:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

📝 Walkthrough

Walkthrough

Adds four new classes to support Eagle3 speculative decoding with DeepSeek V3 architecture: Eagle3DeepSeekV3Attention, Eagle3DeepSeekV3DecoderLayer, Eagle3DeepSeekV3DraftModel, and Eagle3DeepSeekV3ForCausalLM. Updates get_draft_model to select the new draft model variant based on configuration. Adds a kimi_k2 alias in config registry.

Changes

Cohort / File(s) Summary
Eagle3 DeepSeek V3 Speculative Decoding Support
tensorrt_llm/_torch/models/modeling_speculative.py
Added four new classes: Eagle3DeepSeekV3Attention (inherits MLA for multi-head latent attention), Eagle3DeepSeekV3DecoderLayer (with first-layer input dimensionality doubling), Eagle3DeepSeekV3DraftModel (decoder model for draft generation), and Eagle3DeepSeekV3ForCausalLM (causal LM wrapper with weight loading and Eagle3 FC projection support). Modified get_draft_model to instantiate Eagle3DeepSeekV3ForCausalLM when draft config architecture matches. Added MLA import.
Configuration Registry
tensorrt_llm/_torch/pyexecutor/config_utils.py
Added kimi_k2 alias mapping to DeepseekV3Config in _CONFIG_REGISTRY for additional model_type lookup path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • modelng_speculative.py: Four new interrelated classes with MLA-based attention, custom decoder layer with first-layer doubling, and draft model with weight mapping logic require careful review of initialization, forward pass signatures, and weight loading behavior
  • config_utils.py: Trivial registry addition, minimal risk

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description is empty; it only contains the template structure with no actual content filled in for Description or Test Coverage sections. Fill in the Description section explaining what changes are made and why. Fill in the Test Coverage section listing relevant tests that safeguard the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main feature: adding MLA-based Eagle speculative decoding support.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/pyexecutor/config_utils.py (1)

1-2: Missing NVIDIA copyright header.

Per the coding guidelines, all TensorRT-LLM code files should contain an NVIDIA copyright header at the top of the file.

Add a copyright header at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import transformers

As per coding guidelines, all **/*.py files require the NVIDIA copyright header.

tensorrt_llm/_torch/models/modeling_speculative.py (1)

81-87: Incorrect return type annotation on __init__.

The Eagle3DecoderLayer.__init__ has return type annotation -> Tuple[torch.Tensor, torch.Tensor] which is incorrect for a constructor. This should be -> None. Note: The new Eagle3DeepSeekV3DecoderLayer.__init__ correctly uses -> None.

     def __init__(
         self,
         model_config: LlamaConfig,
         layer_idx: int = 0,
         is_first_layer: bool = True,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
+    ) -> None:
         super().__init__()
🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (2)

426-436: Clarify the assertion's intent or convert to a more descriptive error.

The assertion assert self.embed_tokens is not None will fail if the model is used before load_weights_from_target_model is called (when hidden_size_in == config.hidden_size). While this appears intentional as a guard, the error message is opaque.

Consider a clearer error message:

-        assert self.embed_tokens is not None
+        if self.embed_tokens is None:
+            raise RuntimeError(
+                "embed_tokens is None. Ensure load_weights_from_target_model "
+                "is called before forward when sharing embeddings with target model."
+            )

553-568: Consider extracting duplicated apply_eagle3_fc method.

The apply_eagle3_fc method is identical between Eagle3DeepSeekV3ForCausalLM and Eagle3ForCausalLM. Consider extracting to a shared mixin or base class to reduce duplication.

Also applies to: 777-792

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1964bc and 7d8b346.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/config_utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/pyexecutor/config_utils.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/_torch/pyexecutor/config_utils.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-09-18T05:41:54.073Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7120
File: tensorrt_llm/llmapi/llm.py:530-532
Timestamp: 2025-09-18T05:41:54.073Z
Learning: For Kimi k2 model support, the team is initially focusing on the PyTorch backend path where the model directory structure remains consistent, assuming built model directories contain relevant HF config files.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/config_utils.py
🪛 Ruff (0.14.7)
tensorrt_llm/_torch/models/modeling_speculative.py

429-431: Avoid specifying long messages outside the exception class

(TRY003)


504-504: Unused method argument: kwargs

(ARG002)


523-523: Unused method argument: weight_mapper

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/_torch/pyexecutor/config_utils.py (1)

34-37: LGTM!

The kimi_k2 registry entry correctly follows the existing pattern for deepseek_v32, mapping to the same DeepseekV3Config class. This aligns with the learning that Kimi k2 model support focuses on the PyTorch backend path.

tensorrt_llm/_torch/models/modeling_speculative.py (5)

159-218: LGTM!

The Eagle3DeepSeekV3Attention correctly adapts the MLA attention mechanism for Eagle3 speculative decoding. The first-layer projection override with doubled input size (for concatenated [embeds, hidden_states]) follows the same pattern as Eagle3Attention but properly handles MLA-specific parameters like kv_a_proj_with_mqa.


220-311: LGTM!

The Eagle3DeepSeekV3DecoderLayer properly implements the DeepSeekV3-based decoder layer with MLA attention. The structure correctly mirrors Eagle3DecoderLayer while adapting for MLA specifics, and the aux_stream is properly passed through to enable async MLA operations.


428-431: Static analysis: long exception message outside exception class (TRY003).

This is a minor style suggestion. The current pattern is consistent with other code in this file (e.g., line 666-668). Can be addressed if the team adopts stricter linting.


11-11: LGTM!

The MLA import is correctly added alongside Attention to support the new Eagle3DeepSeekV3Attention class.


523-544: Unused weight_mapper parameter differs from sibling class pattern.

The weight_mapper parameter is unused here, unlike Eagle3ForCausalLM.load_weights which passes it to super().load_weights(). If DeepseekV3WeightLoader handles weight mapping internally, consider documenting this or removing the parameter to avoid confusion.

Comment on lines 946 to 949
model_class = Eagle3DeepSeekV3ForCausalLM if draft_config.pretrained_config.architectures[
0] == "Eagle3DeepSeekV3ForCausalLM" else Eagle3ForCausalLM
return model_class(draft_config,
model_config.pretrained_config.num_hidden_layers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add defensive check for architectures attribute access.

Accessing draft_config.pretrained_config.architectures[0] may raise AttributeError if architectures is None or IndexError if it's an empty list.

     if spec_dec_mode.is_eagle3_one_model():
-        model_class = Eagle3DeepSeekV3ForCausalLM if draft_config.pretrained_config.architectures[
-            0] == "Eagle3DeepSeekV3ForCausalLM" else Eagle3ForCausalLM
+        architectures = getattr(draft_config.pretrained_config, 'architectures', None) or []
+        model_class = Eagle3DeepSeekV3ForCausalLM if (
+            architectures and architectures[0] == "Eagle3DeepSeekV3ForCausalLM"
+        ) else Eagle3ForCausalLM
         return model_class(draft_config,
                            model_config.pretrained_config.num_hidden_layers)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_class = Eagle3DeepSeekV3ForCausalLM if draft_config.pretrained_config.architectures[
0] == "Eagle3DeepSeekV3ForCausalLM" else Eagle3ForCausalLM
return model_class(draft_config,
model_config.pretrained_config.num_hidden_layers)
architectures = getattr(draft_config.pretrained_config, 'architectures', None) or []
model_class = Eagle3DeepSeekV3ForCausalLM if (
architectures and architectures[0] == "Eagle3DeepSeekV3ForCausalLM"
) else Eagle3ForCausalLM
return model_class(draft_config,
model_config.pretrained_config.num_hidden_layers)


# Register Eagle3 DeepSeekV3 model for CausalLM
@register_auto_model("Eagle3DeepSeekV3ForCausalLM")
class Eagle3DeepSeekV3ForCausalLM(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IzzyPutterman Thanks for the support! Can we have test cases added for the changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont have a public checkpoint yet, I can do a dummy ckpt, but idk how much that actually tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading a dummy checkpoint + doing one forward pass is better than nothing. It's a decent smoke test that can catch basic issues

@kaiyux kaiyux requested review from lfr-0531 and mikeiovine December 4, 2025 01:19
Signed-off-by: Izzy Putterman <[email protected]>
@IzzyPutterman IzzyPutterman requested review from a team as code owners December 31, 2025 00:58
@IzzyPutterman IzzyPutterman requested a review from hlu1 December 31, 2025 00:58
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30208 [ run ] triggered by Bot. Commit: c548b8e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30208 [ run ] completed with state FAILURE. Commit: c548b8e
/LLM/main/L0_MergeRequest_PR pipeline #23253 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30218 [ run ] triggered by Bot. Commit: 7742ccb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30218 [ run ] completed with state SUCCESS. Commit: 7742ccb
/LLM/main/L0_MergeRequest_PR pipeline #23261 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30252 [ run ] triggered by Bot. Commit: 7742ccb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30252 [ run ] completed with state SUCCESS. Commit: 7742ccb
/LLM/main/L0_MergeRequest_PR pipeline #23291 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30293 [ run ] triggered by Bot. Commit: 7742ccb

)


class Eagle3DeepSeekV3DecoderLayer(DecoderLayer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very similar to the components from non-MLA eagle. Is it possible to reduce the code duplication by injecting the attention part into a common decoder layer class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up and merged impl


# Register Eagle3 DeepSeekV3 model for CausalLM
@register_auto_model("Eagle3DeepSeekV3ForCausalLM")
class Eagle3DeepSeekV3ForCausalLM(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading a dummy checkpoint + doing one forward pass is better than nothing. It's a decent smoke test that can catch basic issues

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30293 [ run ] completed with state SUCCESS. Commit: 7742ccb
/LLM/main/L0_MergeRequest_PR pipeline #23327 completed with status: 'SUCCESS'

@IzzyPutterman IzzyPutterman force-pushed the iputterman/kimi-eagle branch 2 times, most recently from a15044d to 9f1790a Compare December 31, 2025 23:49
Signed-off-by: Izzy Putterman <[email protected]>
@IzzyPutterman IzzyPutterman force-pushed the iputterman/kimi-eagle branch from 9f1790a to bae71e2 Compare January 1, 2026 00:05
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30307 [ run ] triggered by Bot. Commit: bae71e2

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30307 [ run ] completed with state SUCCESS. Commit: bae71e2
/LLM/main/L0_MergeRequest_PR pipeline #23341 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@IzzyPutterman
Copy link
Collaborator Author

/bot run

2 similar comments
@IzzyPutterman
Copy link
Collaborator Author

/bot run

@IzzyPutterman
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30319 [ run ] triggered by Bot. Commit: bae71e2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants