-
Notifications
You must be signed in to change notification settings - Fork 167
new: decouple colbert query and document tokenizer #556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
fastembed/late_interaction/colbert.py (2)
91-93: Prefer an explicit runtime check overassertfor library code.
assertcan be stripped with Python’s -O flag; raise a clear error instead so misuse is obvious in all environments.- assert self.query_tokenizer is not None - encoded = self.query_tokenizer.encode_batch([query]) + if self.query_tokenizer is None: + raise RuntimeError("Query tokenizer is not initialized. Call load_onnx_model() first.") + encoded = self.query_tokenizer.encode_batch([query])
186-204: Use the query tokenizer’s own special-token map to derive its MASK id (robustness).You’re configuring
query_tokenizerpadding withpad_id=self.mask_token_id, which was derived from the document tokenizer’s map. These will typically match (same files), but coupling to the doc tokenizer is unnecessary and could break if token additions diverge. Derive the MASK id from the query tokenizer that you just loaded.- self.query_tokenizer, _ = load_tokenizer(model_dir=self._model_dir) + self.query_tokenizer, query_special_token_to_id = load_tokenizer(model_dir=self._model_dir) assert self.tokenizer is not None self.mask_token_id = self.special_token_to_id[self.MASK_TOKEN] self.pad_token_id = self.tokenizer.padding["pad_id"] self.skip_list = { self.tokenizer.encode(symbol, add_special_tokens=False).ids[0] for symbol in string.punctuation } current_max_length = self.tokenizer.truncation["max_length"] # ensure not to overflow after adding document-marker self.tokenizer.enable_truncation(max_length=current_max_length - 1) - self.query_tokenizer.enable_truncation(max_length=current_max_length - 1) - self.query_tokenizer.enable_padding( - pad_token=self.MASK_TOKEN, - pad_id=self.mask_token_id, - length=self.MIN_QUERY_LENGTH, - ) + self.query_tokenizer.enable_truncation(max_length=current_max_length - 1) + # Derive MASK id from the query tokenizer’s own map (fallback to token_to_id for safety) + query_mask_token_id = query_special_token_to_id.get(self.MASK_TOKEN) + if query_mask_token_id is None: + query_mask_token_id = self.query_tokenizer.token_to_id(self.MASK_TOKEN) # type: ignore[union-attr] + self.query_tokenizer.enable_padding( + pad_token=self.MASK_TOKEN, + pad_id=query_mask_token_id, + length=self.MIN_QUERY_LENGTH, + )Notes:
- Keeping
self.mask_token_idas-is preserves the document path behavior; we only de-couple the query path.- This maintains the invariant “query length before marker = MIN_QUERY_LENGTH” and “after marker = MIN_QUERY_LENGTH + 1”.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
fastembed/late_interaction/colbert.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/late_interaction/colbert.py (1)
fastembed/common/preprocessor_utils.py (1)
load_tokenizer(21-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (2)
fastembed/late_interaction/colbert.py (2)
5-7: LGTM: imports reflect the new design (separate query tokenizer).Importing
Tokenizerandload_tokenizeris appropriate for the decoupled query/document tokenization paths.
172-173: LGTM: explicitquery_tokenizerattribute improves clarity and thread-safety of configuration.Initializing it to
Noneand configuring once during model load avoids the previous “Already borrowed” issue from per-call reconfiguration.
decouple colbert query and document tokenizer in order to avoid problems with multithreading like