-
Notifications
You must be signed in to change notification settings - Fork 167
feat: add public tokenize method for TextEmbedding class #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR adds a public Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/text/text_embedding.py(2 hunks)fastembed/text/text_embedding_base.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/text/text_embedding.py (5)
fastembed/text/text_embedding_base.py (1)
tokenize(23-24)fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-171)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
fastembed/text/text_embedding_base.py (5)
fastembed/text/text_embedding.py (1)
tokenize(166-179)fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-171)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (4)
fastembed/text/text_embedding_base.py (2)
2-2: LGTM: Import is correct.The
Encodingtype from thetokenizerslibrary is the appropriate return type for tokenization methods.
23-24: LGTM: Abstract method design is appropriate.The abstract
tokenizemethod properly defines the interface contract for subclasses. The flexible signature accepting both single strings and iterables with **kwargs is well-designed.fastembed/text/text_embedding.py (2)
3-3: LGTM: Import is consistent.The import matches the base class and is correct for the return type annotation.
166-179: Implementation is correct; all registered models have tokenize method available.Verification confirms that all six models in
EMBEDDINGS_REGISTRY(OnnxTextEmbedding, CLIPOnnxEmbedding, PooledEmbedding, PooledNormalizedEmbedding, JinaEmbeddingV3, CustomTextEmbedding) inherit directly or indirectly from OnnxTextEmbedding, which implements thetokenizemethod. The delegation pattern and string normalization logic are sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
fastembed/late_interaction/late_interaction_embedding_base.py(2 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(2 hunks)fastembed/sparse/sparse_embedding_base.py(2 hunks)fastembed/sparse/sparse_text_embedding.py(2 hunks)fastembed/sparse/splade_pp.py(2 hunks)tests/test_late_interaction_embeddings.py(1 hunks)tests/test_late_interaction_multimodal.py(1 hunks)tests/test_sparse_embeddings.py(1 hunks)tests/test_text_onnx_embeddings.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (10)
tests/test_late_interaction_embeddings.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-132)fastembed/late_interaction/colbert.py (1)
tokenize(83-88)tests/utils.py (1)
delete_model_cache(11-39)
fastembed/late_interaction/late_interaction_text_embedding.py (4)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-135)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-109)fastembed/text/text_embedding.py (1)
tokenize(166-179)
fastembed/sparse/sparse_embedding_base.py (8)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-132)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-135)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-109)fastembed/sparse/splade_pp.py (1)
tokenize(140-153)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)fastembed/text/text_embedding.py (1)
tokenize(166-179)
fastembed/sparse/splade_pp.py (3)
fastembed/sparse/sparse_embedding_base.py (1)
tokenize(48-49)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-109)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-132)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-135)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)
tests/test_sparse_embeddings.py (5)
tests/test_late_interaction_embeddings.py (1)
test_tokenize(298-315)tests/test_text_onnx_embeddings.py (1)
test_tokenize(181-198)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-109)fastembed/sparse/splade_pp.py (1)
tokenize(140-153)tests/utils.py (1)
delete_model_cache(11-39)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-132)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-135)
tests/test_text_onnx_embeddings.py (6)
tests/test_late_interaction_embeddings.py (1)
test_tokenize(298-315)tests/test_late_interaction_multimodal.py (1)
test_tokenize(107-123)tests/test_sparse_embeddings.py (1)
test_tokenize(250-267)fastembed/text/text_embedding.py (2)
TextEmbedding(17-230)tokenize(166-179)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)tests/utils.py (1)
delete_model_cache(11-39)
fastembed/sparse/sparse_text_embedding.py (5)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-132)fastembed/sparse/sparse_embedding_base.py (1)
tokenize(48-49)fastembed/sparse/splade_pp.py (1)
tokenize(140-153)fastembed/text/text_embedding.py (1)
tokenize(166-179)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-171)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)
🪛 GitHub Actions: type-checkers
fastembed/sparse/splade_pp.py
[error] 140-140: Signature of "tokenize" incompatible with supertype "OnnxTextModel"
[error] 153-153: Item "None" of "Any | None" has no attribute "encode_batch"
🪛 Ruff (0.14.1)
fastembed/sparse/splade_pp.py
140-140: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on windows-latest test
🔇 Additional comments (11)
tests/test_text_onnx_embeddings.py (1)
180-198: LGTM!The test correctly validates the
tokenizemethod for both single string and list inputs, ensuring encodings have non-empty ids.fastembed/late_interaction/late_interaction_embedding_base.py (1)
3-3: LGTM!Correctly adds abstract
tokenizeinterface to the base class, establishing the contract for subclasses.Also applies to: 24-25
tests/test_sparse_embeddings.py (1)
249-267: LGTM!The test correctly validates the
tokenizemethod for sparse embeddings with both single string and list inputs.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
3-3: LGTM!Correctly adds abstract
tokenizeinterface to the multimodal embedding base class.Also applies to: 25-26
fastembed/late_interaction/late_interaction_text_embedding.py (1)
4-4: LGTM!The
tokenizeimplementation correctly normalizes input and delegates to the underlying model. Documentation is clear and complete.Also applies to: 119-132
tests/test_late_interaction_embeddings.py (1)
297-315: LGTM!The test correctly validates the
tokenizemethod for late interaction embeddings, including the model-specificis_docparameter for document vs query tokenization.fastembed/sparse/sparse_embedding_base.py (1)
6-6: LGTM!Correctly adds abstract
tokenizeinterface to the sparse embedding base class, maintaining consistency across embedding types.Also applies to: 48-49
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
4-4: LGTM!The
tokenizeimplementation correctly follows the established pattern: normalizes input and delegates to the underlying model with proper documentation.Also applies to: 122-135
fastembed/sparse/sparse_text_embedding.py (2)
4-4: LGTM!The import is necessary for the
tokenizemethod's return type annotation.
96-109: LGTM!The implementation correctly delegates to the underlying model's
tokenizemethod and follows the same pattern as other embedding classes.fastembed/sparse/splade_pp.py (1)
4-4: LGTM!The import is necessary for the
tokenizemethod's return type annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fastembed/late_interaction/colbert.py (1)
83-88: Pass**kwargsthrough to tokenizer methods.The
**kwargsparameter is declared but never passed to the underlying tokenization methods (_tokenize_documentsand_tokenize_query), which in turn don't pass them toencode_batch. This breaks the documented interface contract that allows callers to pass additional arguments to the tokenizer.Apply this diff to pass kwargs through the call chain:
def tokenize(self, documents: list[str], is_doc: bool = True, **kwargs: Any) -> list[Encoding]: # type: ignore[override] return ( - self._tokenize_documents(documents=documents) + self._tokenize_documents(documents=documents, **kwargs) if is_doc - else self._tokenize_query(query=next(iter(documents))) + else self._tokenize_query(query=next(iter(documents)), **kwargs) )Then update the helper methods:
-def _tokenize_query(self, query: str) -> list[Encoding]: +def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]: assert self.query_tokenizer is not None - encoded = self.query_tokenizer.encode_batch([query]) + encoded = self.query_tokenizer.encode_batch([query], **kwargs) return encoded -def _tokenize_documents(self, documents: list[str]) -> list[Encoding]: +def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]: - encoded = self.tokenizer.encode_batch(documents) # type: ignore[union-attr] + encoded = self.tokenizer.encode_batch(documents, **kwargs) # type: ignore[union-attr] return encoded
♻️ Duplicate comments (1)
fastembed/sparse/splade_pp.py (1)
140-156: Critical issues from previous review remain unaddressed.The past review feedback has not been fully addressed:
Critical:
**kwargsis still not passed totokenizer.encode_batch()(line 156), breaking the documented interface that states "Additional arguments passed to the tokenizer".Major: The signature still doesn't match the supertype
OnnxTextModel, which expectsdef tokenize(self, documents: list[str], **kwargs: Any). Thetype: ignore[override]suppresses the error but doesn't fix the design issue. Per the architecture shown in relevant snippets, the model class (SpladePP) should matchOnnxTextModel's signature withlist[str], while the wrapper class (SparseTextEmbedding) handles theUnion[str, Iterable[str]]normalization.Minor: Using
assertfor the None check (line 155) can be bypassed with Python's-Oflag. ARuntimeErrorwould be more robust.Apply this diff to fix all issues (as suggested in the previous review):
- def tokenize(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> list[Encoding]: # type: ignore[override] + def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: """ Tokenize input texts using the model's tokenizer. Args: - texts: String or list of strings to tokenize + documents: List of strings to tokenize **kwargs: Additional arguments passed to the tokenizer Returns: List of tokenizer Encodings """ - if isinstance(texts, str): - texts = [texts] - if not isinstance(texts, list): - texts = list(texts) - assert self.tokenizer is not None - return self.tokenizer.encode_batch(texts) + if self.tokenizer is None: + raise RuntimeError("Tokenizer not initialized") + return self.tokenizer.encode_batch(documents, **kwargs)Note: The single-string normalization logic is already handled by
SparseTextEmbedding.tokenize(lines 105-108 in relevant snippets), so removing it here maintains proper separation of concerns.
🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)
163-172: Consider documenting or utilizing thekwargsparameter.The
**kwargsparameter is unused in this implementation, though it appears in the standardized interface. Consider either:
- Forwarding kwargs to the tokenizer if supported:
encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
- Adding a docstring to clarify the parameter is reserved for future use or API consistency:
def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: # type: ignore[override] """ Tokenize input documents for ColPali queries. Args: documents: List of text queries to tokenize **kwargs: Reserved for future use (currently unused) Returns: List of tokenizer Encodings """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fastembed/late_interaction/colbert.py(1 hunks)fastembed/late_interaction_multimodal/colpali.py(1 hunks)fastembed/sparse/splade_pp.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
fastembed/late_interaction_multimodal/colpali.py (5)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/sparse/splade_pp.py (1)
tokenize(140-156)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
fastembed/late_interaction/colbert.py (3)
fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)fastembed/sparse/splade_pp.py (1)
tokenize(140-156)fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)
fastembed/sparse/splade_pp.py (3)
fastembed/sparse/sparse_embedding_base.py (1)
tokenize(48-49)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-109)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
🪛 Ruff (0.14.1)
fastembed/late_interaction_multimodal/colpali.py
163-163: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
83-83: Unused method argument: kwargs
(ARG002)
fastembed/sparse/splade_pp.py
140-140: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (2)
fastembed/sparse/splade_pp.py (1)
4-4: LGTM!The
Encodingimport is correctly added to support the newtokenizemethod's return type annotation.fastembed/late_interaction_multimodal/colpali.py (1)
4-4: LGTM! Type annotations are correct.The
Encodingimport and method signature withtype: ignore[override]properly align with the standardized tokenize interface being introduced across embedding classes.Also applies to: 163-163
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
fastembed/late_interaction/token_embeddings.py(1 hunks)fastembed/sparse/bm42.py(1 hunks)fastembed/sparse/minicoil.py(1 hunks)fastembed/text/onnx_embedding.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
fastembed/late_interaction/token_embeddings.py (2)
fastembed/text/onnx_embedding.py (1)
OnnxTextEmbedding(186-330)fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(10-76)
fastembed/sparse/bm42.py (2)
fastembed/sparse/sparse_embedding_base.py (2)
SparseTextEmbeddingBase(35-92)SparseEmbedding(14-32)fastembed/text/onnx_text_model.py (1)
OnnxTextModel(17-154)
fastembed/sparse/minicoil.py (2)
fastembed/sparse/sparse_embedding_base.py (2)
SparseTextEmbeddingBase(35-92)SparseEmbedding(14-32)fastembed/text/onnx_text_model.py (1)
OnnxTextModel(17-154)
fastembed/text/onnx_embedding.py (2)
fastembed/text/text_embedding_base.py (1)
TextEmbeddingBase(9-75)fastembed/text/onnx_text_model.py (1)
OnnxTextModel(17-154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)
170-171: Replace assertion with explicit error handling.The assertion can be silently disabled in optimized Python mode (
-Oflag). Sinceself.tokenizercan beNonewhenlazy_load=Trueandload_onnx_model()hasn't been called, use an explicit exception instead:- assert self.tokenizer is not None + if self.tokenizer is None: + raise RuntimeError("Tokenizer not initialized. Call load_onnx_model() first or set lazy_load=False.") encoded = self.tokenizer.encode_batch(texts_query)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
fastembed/late_interaction/colbert.py(1 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(2 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(2 hunks)fastembed/late_interaction_multimodal/onnx_multimodal_model.py(1 hunks)fastembed/sparse/sparse_embedding_base.py(2 hunks)fastembed/sparse/sparse_text_embedding.py(2 hunks)fastembed/sparse/splade_pp.py(2 hunks)fastembed/text/onnx_embedding.py(3 hunks)fastembed/text/onnx_text_model.py(1 hunks)fastembed/text/text_embedding.py(2 hunks)fastembed/text/text_embedding_base.py(2 hunks)tests/test_sparse_embeddings.py(1 hunks)tests/test_text_onnx_embeddings.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- fastembed/sparse/sparse_embedding_base.py
- tests/test_text_onnx_embeddings.py
- tests/test_sparse_embeddings.py
- fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
- fastembed/sparse/sparse_text_embedding.py
- fastembed/text/text_embedding_base.py
- fastembed/late_interaction/late_interaction_text_embedding.py
🧰 Additional context used
🧬 Code graph analysis (9)
fastembed/text/text_embedding.py (3)
fastembed/text/onnx_embedding.py (1)
tokenize(324-334)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)
fastembed/late_interaction_multimodal/colpali.py (4)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)fastembed/sparse/splade_pp.py (1)
tokenize(140-153)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (8)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/text/onnx_embedding.py (1)
tokenize(324-334)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)
fastembed/late_interaction/colbert.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (4)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)
fastembed/text/onnx_text_model.py (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/text/onnx_embedding.py (1)
tokenize(324-334)fastembed/text/text_embedding.py (1)
tokenize(166-177)
fastembed/late_interaction/late_interaction_embedding_base.py (6)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)
fastembed/text/onnx_embedding.py (2)
fastembed/text/text_embedding_base.py (2)
TextEmbeddingBase(9-75)tokenize(23-24)fastembed/text/onnx_text_model.py (2)
OnnxTextModel(17-154)tokenize(71-72)
fastembed/sparse/splade_pp.py (14)
fastembed/late_interaction/colbert.py (1)
tokenize(83-88)fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-172)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
tokenize(83-84)fastembed/sparse/sparse_embedding_base.py (1)
tokenize(48-49)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(96-107)fastembed/text/onnx_embedding.py (1)
tokenize(324-334)fastembed/text/onnx_text_model.py (1)
tokenize(71-72)fastembed/text/text_embedding.py (1)
tokenize(166-177)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/onnx_multimodal_model.py
[error] 84-84: mypy: Item "None" of "Any | None" has no attribute "encode_batch" [union-attr]
[error] 84-84: mypy: Found 1 error in 1 file (checked 57 source files)
🪛 Ruff (0.14.1)
fastembed/late_interaction_multimodal/colpali.py
163-163: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py
83-83: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
83-83: Unused method argument: kwargs
(ARG002)
fastembed/text/onnx_text_model.py
71-71: Unused method argument: kwargs
(ARG002)
fastembed/sparse/splade_pp.py
152-152: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (10)
fastembed/sparse/splade_pp.py (1)
140-153: LGTM!The tokenize implementation is correct with proper None check and kwargs forwarding. The past review concerns have been addressed.
fastembed/text/text_embedding.py (1)
166-177: LGTM!The tokenize method correctly delegates to the underlying model with proper documentation. The implementation follows the pattern established across the codebase.
fastembed/late_interaction/late_interaction_embedding_base.py (2)
3-4: LGTM!Import added to support the new tokenize method signature.
24-25: LGTM!Abstract method correctly declares the tokenize interface for subclasses to implement.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
3-3: LGTM!Import added to support the new tokenize method signature.
25-26: LGTM!Abstract method correctly declares the tokenize interface for subclasses to implement.
fastembed/text/onnx_embedding.py (2)
3-3: LGTM!Import added to support the new tokenize method return type.
324-334: LGTM!The tokenize method correctly delegates to the base
OnnxTextModel.tokenize. The implementation is straightforward and well-documented.fastembed/late_interaction/colbert.py (1)
83-88: LGTM!The parameter rename from
documentstotextsmaintains consistency with other tokenize methods. Thetype: ignore[override]is appropriate since this implementation has an additionalis_docparameter that intentionally differs from the base class signature.fastembed/late_interaction_multimodal/colpali.py (1)
163-163: The unusedkwargsparameter is required by the base class signature.While static analysis flags
**kwargsas unused, it's necessary to match theLateInteractionMultimodalEmbeddingBase.tokenizesignature. This is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
fastembed/sparse/bm25.py (1)
151-156: Move the inner class definition outside the method.The
SimpleEncodingclass is currently defined inside thetokenizemethod, meaning it's recreated on every call. For better performance, especially with high-frequency tokenization, define it at the class level or module level (see suggested fix in previous comment).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
fastembed/sparse/bm25.py(2 hunks)fastembed/sparse/bm42.py(3 hunks)fastembed/sparse/minicoil.py(3 hunks)tests/test_sparse_embeddings.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- fastembed/sparse/minicoil.py
- fastembed/sparse/bm42.py
- tests/test_sparse_embeddings.py
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/sparse/bm25.py (2)
fastembed/text/onnx_text_model.py (1)
tokenize(71-72)fastembed/sparse/utils/tokenizer.py (2)
tokenize(8-12)tokenize(83-120)
🪛 Ruff (0.14.1)
fastembed/sparse/bm25.py
140-140: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (1)
fastembed/sparse/bm25.py (1)
140-140: Unusedkwargsparameter is acceptable for API consistency.The static analysis tool flagged
kwargsas unused, but this is likely intentional to maintain a consistenttokenizesignature across all embedding classes (as seen inOnnxTextModel.tokenize). No action needed.
fastembed/sparse/bm25.py
Outdated
| import mmh3 | ||
| import numpy as np | ||
| from py_rust_stemmers import SnowballStemmer | ||
| from tokenizers import Encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation is incorrect and breaks the Encoding interface contract.
The method signature claims to return list[Encoding], but the custom SimpleEncoding class doesn't properly implement the Encoding interface:
Encoding.idsshould belist[int](token IDs), but hereself.ids = tokensassigns alist[str]- The
type: ignorecomments on lines 157-158 suppress type checker errors that correctly identify this mismatch - Downstream code expecting actual
Encodingobjects with integer IDs will fail at runtime
Since BM25 uses string tokens rather than learned vocabulary IDs, consider one of these approaches:
Option 1 (Recommended): Return a proper protocol or custom type:
+from typing import Protocol
+
+class TokenizationResult(Protocol):
+ """Protocol for tokenization results."""
+ tokens: list[str]
+ ids: list[str] # For BM25, tokens serve as IDs
+ attention_mask: list[int]
+
+# Move SimpleEncoding to class level
+class SimpleEncoding:
+ def __init__(self, tokens: list[str]):
+ self.tokens = tokens
+ self.ids = tokens
+ self.attention_mask = [1] * len(tokens)
+
class Bm25(SparseTextEmbeddingBase):
...
- def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]:
+ def tokenize(self, texts: list[str], **kwargs: Any) -> list[SimpleEncoding]:
"""Tokenize texts using SimpleTokenizer.
- Returns a list of simple Encoding-like objects with token strings.
+ Returns a list of SimpleEncoding objects with token strings.
Note: BM25 uses a simple word tokenizer, not a learned tokenizer.
+ Unlike standard Encoding objects, ids contain string tokens, not integers.
"""
result = []
for text in texts:
tokens = self.tokenizer.tokenize(text)
-
- # Create a simple object that mimics Encoding interface
- class SimpleEncoding:
- def __init__(self, tokens: list[str]):
- self.tokens = tokens
- self.ids = tokens # For BM25, tokens are the IDs
- self.attention_mask = [1] * len(tokens)
-
- result.append(SimpleEncoding(tokens)) # type: ignore[arg-type]
- return result # type: ignore[return-value]
+ result.append(SimpleEncoding(tokens))
+ return resultOption 2: Hash tokens to integers to match Encoding.ids contract:
def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]:
"""Tokenize texts using SimpleTokenizer and hash to integer IDs."""
result = []
for text in texts:
tokens = self.tokenizer.tokenize(text)
class SimpleEncoding:
def __init__(self, tokens: list[str]):
self.tokens = tokens
- self.ids = tokens
+ self.ids = [Bm25.compute_token_id(t) for t in tokens]
self.attention_mask = [1] * len(tokens)
- result.append(SimpleEncoding(tokens)) # type: ignore[arg-type]
- return result # type: ignore[return-value]
+ result.append(SimpleEncoding(tokens))
+ return resultAlso applies to: 140-158
🤖 Prompt for AI Agents
In fastembed/sparse/bm25.py around lines 10 and 140-158, the function currently
claims to return list[Encoding] but the SimpleEncoding class assigns string
tokens to self.ids (violating Encoding.ids: list[int]) and suppresses type
errors with type: ignore; fix by not using tokenizers.Encoding as the return
type: define a lightweight custom Protocol or dataclass (e.g.,
SimpleTokenEncoding) with fields matching how BM25 uses them (tokens: list[str],
offsets, attention_mask, etc.), update the function signature to return
List[SimpleTokenEncoding], remove the type: ignore comments, and adjust any
downstream code to use the new type; alternatively, if you must keep the
Encoding type, convert/hash tokens to integer IDs and ensure Encoding.ids is
list[int] and all Encoding interface fields are implemented consistently.
fastembed/sparse/bm25.py
Outdated
| class SimpleEncoding: | ||
| def __init__(self, tokens: list[str]): | ||
| self.tokens = tokens | ||
| self.ids = tokens # For BM25, tokens are the IDs | ||
| self.attention_mask = [1] * len(tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure yet if we want to mimic Encoding here
but we definitely don't want to create a class in a loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
fastembed/late_interaction/colbert.py (1)
86-93: Consider forwarding kwargs for consistency.The
**kwargsparameter is declared but not forwarded to the internal_tokenize_documentsand_tokenize_querymethods. While these methods don't currently use kwargs, forwarding them would maintain consistency with other tokenize implementations in this PR and enable future customization without changing the signature.If you'd like to forward kwargs for consistency, apply this diff:
def _tokenize( self, documents: list[str], is_doc: bool = True, **kwargs: Any ) -> list[Encoding]: return ( - self._tokenize_documents(documents=documents) + self._tokenize_documents(documents=documents, **kwargs) if is_doc - else self._tokenize_query(query=next(iter(documents))) + else self._tokenize_query(query=next(iter(documents)), **kwargs) )And update the helper methods:
-def _tokenize_query(self, query: str) -> list[Encoding]: +def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]: assert self.query_tokenizer is not None encoded = self.query_tokenizer.encode_batch([query]) return encoded -def _tokenize_documents(self, documents: list[str]) -> list[Encoding]: +def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]: encoded = self.tokenizer.encode_batch(documents) # type: ignore[union-attr] return encodedfastembed/sparse/sparse_embedding_base.py (1)
47-48: Add docstring and tests for the new public API method.The
tokenizemethod establishes a public API interface but lacks documentation. Consider adding a docstring that explains the expected parameters, return value structure, and intended usage pattern for when implementations become available.Additionally, the PR checklist indicates no tests were added for this feature. While the current implementation is a stub, adding tests that verify the NotImplementedError behavior would establish a baseline for future implementations.
Apply this diff to add a docstring:
def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]: + """ + Tokenizes a list of documents. + + Args: + documents (list[str]): The list of documents to tokenize. + **kwargs: Additional keyword arguments for tokenization. + + Returns: + dict[str, Any]: Tokenized representation of the documents. + + Raises: + NotImplementedError: Tokenization is not yet implemented for sparse embeddings. + """ raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
fastembed/late_interaction/colbert.py(1 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(2 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(2 hunks)fastembed/late_interaction_multimodal/colpali.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(2 hunks)fastembed/late_interaction_multimodal/onnx_multimodal_model.py(1 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(1 hunks)fastembed/sparse/bm25.py(1 hunks)fastembed/sparse/bm42.py(1 hunks)fastembed/sparse/minicoil.py(1 hunks)fastembed/sparse/sparse_embedding_base.py(1 hunks)fastembed/sparse/sparse_text_embedding.py(2 hunks)fastembed/sparse/splade_pp.py(2 hunks)fastembed/text/onnx_embedding.py(2 hunks)fastembed/text/onnx_text_model.py(1 hunks)fastembed/text/text_embedding.py(2 hunks)fastembed/text/text_embedding_base.py(2 hunks)tests/test_late_interaction_embeddings.py(2 hunks)tests/test_sparse_embeddings.py(1 hunks)tests/test_text_onnx_embeddings.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
- fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
- fastembed/sparse/splade_pp.py
- tests/test_sparse_embeddings.py
- fastembed/text/onnx_embedding.py
- tests/test_text_onnx_embeddings.py
- fastembed/sparse/bm42.py
- fastembed/text/text_embedding_base.py
- fastembed/sparse/bm25.py
- tests/test_late_interaction_embeddings.py
- fastembed/late_interaction/late_interaction_embedding_base.py
🧰 Additional context used
🧬 Code graph analysis (10)
fastembed/sparse/sparse_embedding_base.py (6)
fastembed/sparse/bm25.py (1)
tokenize(139-140)fastembed/sparse/bm42.py (1)
tokenize(133-134)fastembed/sparse/minicoil.py (1)
tokenize(141-142)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(95-96)fastembed/sparse/splade_pp.py (1)
tokenize(139-140)fastembed/sparse/utils/tokenizer.py (1)
tokenize(8-12)
fastembed/late_interaction/colbert.py (6)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (2)
tokenize(163-164)_tokenize(166-175)fastembed/text/onnx_embedding.py (1)
tokenize(324-325)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
_tokenize(83-86)fastembed/text/onnx_text_model.py (1)
_tokenize(71-72)
fastembed/text/text_embedding.py (7)
fastembed/late_interaction/colbert.py (1)
tokenize(83-84)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-164)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)fastembed/text/onnx_embedding.py (1)
tokenize(324-325)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)
fastembed/late_interaction/colbert.py (1)
_tokenize(86-93)fastembed/late_interaction_multimodal/colpali.py (1)
_tokenize(166-175)fastembed/text/onnx_text_model.py (1)
_tokenize(71-72)fastembed/common/onnx_model.py (1)
OnnxOutputContext(20-23)
fastembed/late_interaction_multimodal/colpali.py (3)
fastembed/late_interaction/colbert.py (1)
_tokenize(86-93)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
_tokenize(83-86)fastembed/text/onnx_text_model.py (1)
_tokenize(71-72)
fastembed/late_interaction/late_interaction_text_embedding.py (4)
fastembed/late_interaction/colbert.py (1)
tokenize(83-84)fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/text/text_embedding.py (1)
tokenize(166-177)
fastembed/sparse/minicoil.py (5)
fastembed/sparse/bm25.py (1)
tokenize(139-140)fastembed/sparse/bm42.py (1)
tokenize(133-134)fastembed/sparse/sparse_embedding_base.py (1)
tokenize(47-48)fastembed/sparse/sparse_text_embedding.py (1)
tokenize(95-96)fastembed/sparse/splade_pp.py (1)
tokenize(139-140)
fastembed/sparse/sparse_text_embedding.py (2)
fastembed/common/model_description.py (1)
SparseModelDescription(44-46)fastembed/sparse/sparse_embedding_base.py (1)
tokenize(47-48)
fastembed/text/onnx_text_model.py (4)
fastembed/late_interaction/colbert.py (1)
_tokenize(86-93)fastembed/late_interaction_multimodal/colpali.py (1)
_tokenize(166-175)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
_tokenize(83-86)fastembed/common/onnx_model.py (2)
onnx_embed(119-120)OnnxOutputContext(20-23)
fastembed/rerank/cross_encoder/onnx_text_model.py (6)
fastembed/late_interaction/colbert.py (1)
tokenize(83-84)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-164)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/text/onnx_embedding.py (1)
tokenize(324-325)fastembed/text/text_embedding.py (1)
tokenize(166-177)
🪛 Ruff (0.14.3)
fastembed/late_interaction/colbert.py
87-87: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py
85-85: Avoid specifying long messages outside the exception class
(TRY003)
fastembed/late_interaction_multimodal/colpali.py
166-166: Unused method argument: kwargs
(ARG002)
fastembed/text/onnx_text_model.py
71-71: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
119-130: LGTM!The tokenize method implementation is clean and consistent with the pattern established across other embedding classes in this PR.
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
48-49: LGTM!The change properly forwards
**kwargstoencode_batch, enabling customizable tokenization behavior for reranking pairs.fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
83-86: LGTM!The
_tokenizemethod properly guards against uninitialized tokenizer and correctly forwards**kwargstoencode_batch.fastembed/sparse/sparse_text_embedding.py (1)
1-7: ****The imports are not accidentally included. The
git diffshows they were reorganized from their original positions inmain(where they were at the bottom of the imports section) to follow PEP 8 style conventions (stdlib imports first, then project imports). Both imports are actively used:warningson line 67 for a deprecation warning, andSparseModelDescriptionon lines 48-49 in the_list_supported_modelsmethod type hints. This reorganization is a legitimate cleanup that accompanies the tokenize method addition.Likely an incorrect or invalid review comment.
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | ||
| texts_query: list[str] = [] | ||
| for query in documents: | ||
| query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10 | ||
| query += "\n" | ||
|
|
||
| texts_query.append(query) | ||
| encoded = self.tokenizer.encode_batch(texts_query) # type: ignore[union-attr] | ||
| assert self.tokenizer is not None | ||
| encoded = self.tokenizer.encode_batch(texts_query) | ||
| return encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward kwargs to enable tokenization customization.
The **kwargs parameter is declared but not forwarded to encode_batch at line 174. This prevents callers from customizing tokenization behavior. Other implementations in this PR forward kwargs (e.g., onnx_multimodal_model.py).
Apply this diff:
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
texts_query: list[str] = []
for query in documents:
query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
query += "\n"
texts_query.append(query)
assert self.tokenizer is not None
- encoded = self.tokenizer.encode_batch(texts_query)
+ encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
return encoded📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | |
| texts_query: list[str] = [] | |
| for query in documents: | |
| query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10 | |
| query += "\n" | |
| texts_query.append(query) | |
| encoded = self.tokenizer.encode_batch(texts_query) # type: ignore[union-attr] | |
| assert self.tokenizer is not None | |
| encoded = self.tokenizer.encode_batch(texts_query) | |
| return encoded | |
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | |
| texts_query: list[str] = [] | |
| for query in documents: | |
| query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10 | |
| query += "\n" | |
| texts_query.append(query) | |
| assert self.tokenizer is not None | |
| encoded = self.tokenizer.encode_batch(texts_query, **kwargs) | |
| return encoded |
🧰 Tools
🪛 Ruff (0.14.3)
166-166: Unused method argument: kwargs
(ARG002)
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colpali.py around lines 166 to 175, the
method _tokenize accepts **kwargs but does not pass them to
self.tokenizer.encode_batch; update the call to forward kwargs (e.g.,
self.tokenizer.encode_batch(texts_query, **kwargs)) while keeping the existing
assertion that tokenizer is not None so callers can customize tokenization
options like truncation, padding, or return_tensors.
| def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]: | ||
| raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove duplicate implementation or delegate to self.model.
The tokenize method duplicates the base class implementation exactly. Since SparseTextEmbedding follows a delegation pattern for other methods (embed delegates to self.model.embed(), query_embed delegates to self.model.query_embed()), the tokenize method should either:
- Preferred: Delegate to
self.model.tokenize()for consistency with the existing pattern, allowing concrete implementations to provide their own tokenization logic in the future. - Alternative: Remove this override entirely and inherit the base class implementation.
The current implementation breaks the delegation pattern and creates unnecessary duplication.
Option 1 (Preferred): Delegate to self.model
def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
- raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
+ """
+ Tokenizes a list of documents.
+
+ Args:
+ documents (list[str]): The list of documents to tokenize.
+ **kwargs: Additional keyword arguments for tokenization.
+
+ Returns:
+ dict[str, Any]: Tokenized representation of the documents.
+ """
+ return self.model.tokenize(documents, **kwargs)Option 2 (Alternative): Remove the override
-def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
- raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
-Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In fastembed/sparse/sparse_text_embedding.py around lines 95-96, the tokenize
override duplicates the base class implementation and breaks the delegation
pattern; change the method to delegate to the underlying model by returning
self.model.tokenize(documents, **kwargs) (preferred), or remove this override
entirely so the class inherits the base implementation; ensure to pass through
kwargs and return the result unchanged.
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | ||
| return self.tokenizer.encode_batch(documents) # type:ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward kwargs to enable tokenization customization.
The **kwargs parameter is declared but not forwarded to encode_batch. This breaks customization options that callers might pass through the tokenization chain. Other implementations in this PR (e.g., onnx_multimodal_model.py, onnx_text_model.py in rerank) forward kwargs.
Apply this diff:
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
- return self.tokenizer.encode_batch(documents) # type:ignore[union-attr]
+ return self.tokenizer.encode_batch(documents, **kwargs) # type:ignore[union-attr]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | |
| return self.tokenizer.encode_batch(documents) # type:ignore[union-attr] | |
| def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: | |
| return self.tokenizer.encode_batch(documents, **kwargs) # type:ignore[union-attr] |
🧰 Tools
🪛 Ruff (0.14.3)
71-71: Unused method argument: kwargs
(ARG002)
🤖 Prompt for AI Agents
In fastembed/text/onnx_text_model.py around lines 71 to 72, the _tokenize method
accepts **kwargs but does not forward them to tokenizer.encode_batch; update the
return to call self.tokenizer.encode_batch(documents, **kwargs) (preserving the
existing type-ignore if needed) so callers can customize tokenization options
and match other implementations in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fastembed/text/text_embedding_base.py (1)
23-24: Consider alignment with existing abstract method patterns.The new
tokenizemethod includes a descriptive message inNotImplementedError, while other abstract methods in this class (lines 33, 70) raiseNotImplementedError()without a message. For consistency, consider either:
- Removing the message:
raise NotImplementedError()- Adding the message to other abstract methods
Additionally, consider adding a docstring to document the expected behavior for subclasses, similar to the docstrings in concrete implementations like
TextEmbedding.tokenize(fastembed/text/text_embedding.py, lines 165-176).fastembed/late_interaction/late_interaction_embedding_base.py (1)
24-25: Consider alignment with existing abstract method patterns.Similar to the comment on
TextEmbeddingBase, the newtokenizemethod includes a descriptive message inNotImplementedError, while theembedmethod (line 34) raisesNotImplementedError()without a message. Consider standardizing the approach across abstract methods in this class.Additionally, consider adding a docstring to document the expected behavior for subclasses, matching the pattern used in concrete implementations like
LateInteractionTextEmbedding.tokenize(fastembed/late_interaction/late_interaction_text_embedding.py, lines 118-129).fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
25-26: Consider alignment with existing patterns and verify test coverage.Similar to the other base classes, the new
tokenizemethod includes a descriptive message inNotImplementedError, whileembed_textandembed_imagemethods (lines 50, 72) raiseNotImplementedError()without messages. Consider standardizing the approach.Additionally, consider adding a docstring matching the pattern in
LateInteractionMultimodalEmbedding.tokenize(fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py, lines 121-132).Test coverage: The PR checklist indicates "Added tests for the feature: ⛔", but the AI summary mentions tests were added in
tests/test_late_interaction_embeddings.pyandtests/test_late_interaction_multimodal.py. Please verify that the new publictokenizeAPI has adequate test coverage across all embedding types.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fastembed/late_interaction/late_interaction_embedding_base.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(2 hunks)fastembed/text/text_embedding_base.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
fastembed/text/text_embedding_base.py (9)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-164)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction/colbert.py (1)
tokenize(83-84)fastembed/text/onnx_embedding.py (1)
tokenize(324-325)fastembed/text/text_embedding.py (1)
tokenize(166-177)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
tokenize(48-49)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (6)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
tokenize(24-25)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/colpali.py (1)
tokenize(163-164)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction/colbert.py (1)
tokenize(83-84)
fastembed/late_interaction/late_interaction_embedding_base.py (7)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
tokenize(25-26)fastembed/text/text_embedding_base.py (1)
tokenize(23-24)fastembed/late_interaction/late_interaction_text_embedding.py (1)
tokenize(119-130)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
tokenize(122-133)fastembed/late_interaction/colbert.py (1)
tokenize(83-84)fastembed/text/onnx_embedding.py (1)
tokenize(324-325)fastembed/text/text_embedding.py (1)
tokenize(166-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/text/text_embedding_base.py (1)
2-2: LGTM! Import is necessary for the type hint.The
Encodingimport fromtokenizersis correctly added to support the return type annotation of the newtokenizemethod.fastembed/late_interaction/late_interaction_embedding_base.py (1)
3-3: LGTM! Import is necessary for the type hint.The
Encodingimport fromtokenizersis correctly added to support the return type annotation of the newtokenizemethod.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
3-3: LGTM! Import is necessary for the type hint.The
Encodingimport fromtokenizersis correctly added to support the return type annotation of the newtokenizemethod.
|
We've decided not to expose |
* chore: remove tokenize impl from sparse embedding * chore: remove tokenize test * fix: return type hint --------- Co-authored-by: George Panchuk <[email protected]>
fe6d0dd to
683eba2
Compare
…ed token counts; raise NotImplemented for sparse
…non-padded token counts; raise NotImplemented for sparse" This reverts commit a26aad6.
|
closing in favour of #583 |
Expose
tokenizemethod instead of usingembedding_model.model.tokenize.All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?New models submission: