Skip to content

[Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie#21225

Open
kpham-sgl wants to merge 3 commits intosgl-project:mainfrom
kpham-sgl:kp/expand-trie-match
Open

[Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie#21225
kpham-sgl wants to merge 3 commits intosgl-project:mainfrom
kpham-sgl:kp/expand-trie-match

Conversation

@kpham-sgl
Copy link
Collaborator

Motivation

Part of Ngram refactoring series #21052
Following #21186

Modifications

  • Remove min_match_window_size and max_match_window_size from NGRAM speculative decoding across the trie, Python plumbing, config, and docs.
  • Match all suffixes with continuations up to max_trie_depth instead of restricting matches to a configurable window range.
  • Update tests to cover the new suffix-matching behavior and long-context cases.

Accuracy Tests

Passed python3 -m pytest test/registered/spec/utils/test_ngram_corpus.py -q

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the Ngram speculative decoding system by simplifying its matching logic. It removes the min_match_window_size and max_match_window_size parameters, allowing the system to consistently match all available suffixes within the trie up to the max_trie_depth. This change aims to streamline configuration and potentially enhance the efficiency of the Ngram cache by focusing on the maximum trie depth for matching.

Highlights

  • Ngram Parameter Removal: Removed min_match_window_size and max_match_window_size parameters from Ngram speculative decoding across all relevant code, configuration, and documentation.
  • Suffix Matching Logic Update: Modified the Ngram matching mechanism to match all suffixes of the Trie up to max_trie_depth, eliminating the previous restriction to a configurable window range.
  • Test Coverage Expansion: Updated existing tests and added new ones to validate the new suffix-matching behavior and ensure correct functionality in long-context scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the N-gram speculative decoding by removing min_match_window_size and max_match_window_size. Instead, it now matches all suffixes up to max_trie_depth. The changes are consistently applied across the C++ implementation, Python plumbing, configuration, documentation, and tests. The core logic in trie.cpp is updated to insert and match all suffixes, and a new test case is added to verify this behavior. My review found one minor inconsistency between the updated documentation and the code regarding the default value of speculative_num_draft_tokens, for which I've provided a suggestion. Overall, this is a good simplification and improvement of the N-gram speculation logic.

Comment on lines +3050 to 3054
self.speculative_num_draft_tokens = 12
logger.warning(
"speculative_num_draft_tokens is set to 12 by default for ngram speculative decoding. "
"You can override this by explicitly setting --speculative-num-draft-tokens."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a small inconsistency between the code and the documentation for the default value of speculative_num_draft_tokens. The documentation in docs/advanced_features/speculative_decoding.md states that the default is min(--speculative-ngram-max-trie-depth, 12), but the code here hardcodes it to 12. While this is correct for the default max_trie_depth, it can be misleading if a user sets a custom max_trie_depth less than 12. To align with the documentation and provide more intuitive behavior, it would be better to calculate the default dynamically.

Suggested change
self.speculative_num_draft_tokens = 12
logger.warning(
"speculative_num_draft_tokens is set to 12 by default for ngram speculative decoding. "
"You can override this by explicitly setting --speculative-num-draft-tokens."
)
self.speculative_num_draft_tokens = min(self.speculative_ngram_max_trie_depth, 12)
logger.warning(
f"speculative_num_draft_tokens is set to {self.speculative_num_draft_tokens} by default for ngram speculative decoding. "
"You can override this by explicitly setting --speculative-num-draft-tokens."
)

for req in batch.reqs:
check_token = self._efficient_concat_last_n(
req.origin_input_ids, req.output_ids, self.max_match_window_size
req.origin_input_ids, req.output_ids, self.max_trie_depth
Copy link
Collaborator Author

@kpham-sgl kpham-sgl Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no reason to match a suffix longer than self.max_trie_depth

if self.speculative_num_draft_tokens is None:
self.speculative_num_draft_tokens = (
self.speculative_ngram_max_match_window_size
self.speculative_num_draft_tokens = 12
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats a better default value here? How to tie this value to max_trie_depth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lora speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant