[Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie#21225
[Spec][Ngram] 4/N: Remove max_match_window_size and min_match_window_size, matching all suffixes of the Trie#21225kpham-sgl wants to merge 3 commits intosgl-project:mainfrom
max_match_window_size and min_match_window_size, matching all suffixes of the Trie#21225Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the Ngram speculative decoding system by simplifying its matching logic. It removes the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the N-gram speculative decoding by removing min_match_window_size and max_match_window_size. Instead, it now matches all suffixes up to max_trie_depth. The changes are consistently applied across the C++ implementation, Python plumbing, configuration, documentation, and tests. The core logic in trie.cpp is updated to insert and match all suffixes, and a new test case is added to verify this behavior. My review found one minor inconsistency between the updated documentation and the code regarding the default value of speculative_num_draft_tokens, for which I've provided a suggestion. Overall, this is a good simplification and improvement of the N-gram speculation logic.
| self.speculative_num_draft_tokens = 12 | ||
| logger.warning( | ||
| "speculative_num_draft_tokens is set to 12 by default for ngram speculative decoding. " | ||
| "You can override this by explicitly setting --speculative-num-draft-tokens." | ||
| ) |
There was a problem hiding this comment.
There's a small inconsistency between the code and the documentation for the default value of speculative_num_draft_tokens. The documentation in docs/advanced_features/speculative_decoding.md states that the default is min(--speculative-ngram-max-trie-depth, 12), but the code here hardcodes it to 12. While this is correct for the default max_trie_depth, it can be misleading if a user sets a custom max_trie_depth less than 12. To align with the documentation and provide more intuitive behavior, it would be better to calculate the default dynamically.
| self.speculative_num_draft_tokens = 12 | |
| logger.warning( | |
| "speculative_num_draft_tokens is set to 12 by default for ngram speculative decoding. " | |
| "You can override this by explicitly setting --speculative-num-draft-tokens." | |
| ) | |
| self.speculative_num_draft_tokens = min(self.speculative_ngram_max_trie_depth, 12) | |
| logger.warning( | |
| f"speculative_num_draft_tokens is set to {self.speculative_num_draft_tokens} by default for ngram speculative decoding. " | |
| "You can override this by explicitly setting --speculative-num-draft-tokens." | |
| ) |
| for req in batch.reqs: | ||
| check_token = self._efficient_concat_last_n( | ||
| req.origin_input_ids, req.output_ids, self.max_match_window_size | ||
| req.origin_input_ids, req.output_ids, self.max_trie_depth |
There was a problem hiding this comment.
There should be no reason to match a suffix longer than self.max_trie_depth
| if self.speculative_num_draft_tokens is None: | ||
| self.speculative_num_draft_tokens = ( | ||
| self.speculative_ngram_max_match_window_size | ||
| self.speculative_num_draft_tokens = 12 |
There was a problem hiding this comment.
Whats a better default value here? How to tie this value to max_trie_depth
Motivation
Part of Ngram refactoring series #21052
Following #21186
Modifications
min_match_window_sizeandmax_match_window_sizefrom NGRAM speculative decoding across the trie, Python plumbing, config, and docs.max_trie_depthinstead of restricting matches to a configurable window range.Accuracy Tests
Passed
python3 -m pytest test/registered/spec/utils/test_ngram_corpus.py -qBenchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci