Skip to content

Conversation

@dancixx
Copy link

@dancixx dancixx commented Oct 21, 2025

Expose tokenize method instead of using embedding_model.model.tokenize.

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

  • Have you added an explanation of why it's important to include this model?
  • Have you added tests for the new model? Were canonical values for tests computed via the original model?
  • Have you added the code snippet for how canonical values were computed?
  • Have you successfully ran tests with your changes locally?

@dancixx dancixx requested a review from joein October 21, 2025 15:37
@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

📝 Walkthrough

Walkthrough

This PR adds a public tokenize API across multiple embedding classes and their bases. Base classes (TextEmbeddingBase, LateInteractionTextEmbeddingBase, LateInteractionMultimodalEmbeddingBase, SparseTextEmbeddingBase) receive a tokenize method (raising NotImplementedError). Dense/late-interaction/onnx implementations delegate to underlying model tokenizers and return list[Encoding]. Sparse implementations add tokenize stubs returning dict[str, Any] and currently raise NotImplementedError. Several model-level methods were renamed/redispatched (e.g., public tokenize → private _tokenize in some ONNX/multimodal modules). Tests for tokenize were added for multiple model types.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas needing extra attention:
    • ColBERT refactor: removal of is_doc from public signature and addition of _tokenize delegator.
    • Consistency of return types between dense (list[Encoding]) and sparse (dict[str, Any]) tokenizers and callers.
    • Visibility changes: public tokenize → private _tokenize in ONNX / multimodal modules and updated call sites.
    • Tests added for tokenization behavior (validate CI cleanup and model-specific assumptions).
    • New NotImplementedError stubs in sparse classes (ensure callers expect/handle this).

Possibly related PRs

  • MiniCOIL v1 #513 — Modifies MiniCOIL implementation; relates to this PR which added a tokenize stub to minicoil.py.

Suggested reviewers

  • generall

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature being added: a public tokenize method exposed on the TextEmbedding class instead of requiring access to the underlying model.
Description check ✅ Passed The description clearly states the purpose of the PR - exposing a public tokenize method on TextEmbedding class rather than using embedding_model.model.tokenize, and includes relevant submission checklist items.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/expose-tokenize-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba1f605 and 221a331.

📒 Files selected for processing (2)
  • fastembed/text/text_embedding.py (2 hunks)
  • fastembed/text/text_embedding_base.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/text/text_embedding.py (5)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-171)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/text/text_embedding_base.py (5)
fastembed/text/text_embedding.py (1)
  • tokenize (166-179)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-171)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (4)
fastembed/text/text_embedding_base.py (2)

2-2: LGTM: Import is correct.

The Encoding type from the tokenizers library is the appropriate return type for tokenization methods.


23-24: LGTM: Abstract method design is appropriate.

The abstract tokenize method properly defines the interface contract for subclasses. The flexible signature accepting both single strings and iterables with **kwargs is well-designed.

fastembed/text/text_embedding.py (2)

3-3: LGTM: Import is consistent.

The import matches the base class and is correct for the return type annotation.


166-179: Implementation is correct; all registered models have tokenize method available.

Verification confirms that all six models in EMBEDDINGS_REGISTRY (OnnxTextEmbedding, CLIPOnnxEmbedding, PooledEmbedding, PooledNormalizedEmbedding, JinaEmbeddingV3, CustomTextEmbedding) inherit directly or indirectly from OnnxTextEmbedding, which implements the tokenize method. The delegation pattern and string normalization logic are sound.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 221a331 and 555a362.

📒 Files selected for processing (11)
  • fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
  • fastembed/sparse/sparse_embedding_base.py (2 hunks)
  • fastembed/sparse/sparse_text_embedding.py (2 hunks)
  • fastembed/sparse/splade_pp.py (2 hunks)
  • tests/test_late_interaction_embeddings.py (1 hunks)
  • tests/test_late_interaction_multimodal.py (1 hunks)
  • tests/test_sparse_embeddings.py (1 hunks)
  • tests/test_text_onnx_embeddings.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (10)
tests/test_late_interaction_embeddings.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-132)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
tests/utils.py (1)
  • delete_model_cache (11-39)
fastembed/late_interaction/late_interaction_text_embedding.py (4)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-135)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-109)
fastembed/text/text_embedding.py (1)
  • tokenize (166-179)
fastembed/sparse/sparse_embedding_base.py (8)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-132)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-135)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-109)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-153)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/text/text_embedding.py (1)
  • tokenize (166-179)
fastembed/sparse/splade_pp.py (3)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (48-49)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-109)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/late_interaction/late_interaction_embedding_base.py (3)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-132)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-135)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
tests/test_sparse_embeddings.py (5)
tests/test_late_interaction_embeddings.py (1)
  • test_tokenize (298-315)
tests/test_text_onnx_embeddings.py (1)
  • test_tokenize (181-198)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-109)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-153)
tests/utils.py (1)
  • delete_model_cache (11-39)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-132)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-135)
tests/test_text_onnx_embeddings.py (6)
tests/test_late_interaction_embeddings.py (1)
  • test_tokenize (298-315)
tests/test_late_interaction_multimodal.py (1)
  • test_tokenize (107-123)
tests/test_sparse_embeddings.py (1)
  • test_tokenize (250-267)
fastembed/text/text_embedding.py (2)
  • TextEmbedding (17-230)
  • tokenize (166-179)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
tests/utils.py (1)
  • delete_model_cache (11-39)
fastembed/sparse/sparse_text_embedding.py (5)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-132)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (48-49)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-153)
fastembed/text/text_embedding.py (1)
  • tokenize (166-179)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-171)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
🪛 GitHub Actions: type-checkers
fastembed/sparse/splade_pp.py

[error] 140-140: Signature of "tokenize" incompatible with supertype "OnnxTextModel"


[error] 153-153: Item "None" of "Any | None" has no attribute "encode_batch"

🪛 Ruff (0.14.1)
fastembed/sparse/splade_pp.py

140-140: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
🔇 Additional comments (11)
tests/test_text_onnx_embeddings.py (1)

180-198: LGTM!

The test correctly validates the tokenize method for both single string and list inputs, ensuring encodings have non-empty ids.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

3-3: LGTM!

Correctly adds abstract tokenize interface to the base class, establishing the contract for subclasses.

Also applies to: 24-25

tests/test_sparse_embeddings.py (1)

249-267: LGTM!

The test correctly validates the tokenize method for sparse embeddings with both single string and list inputs.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

3-3: LGTM!

Correctly adds abstract tokenize interface to the multimodal embedding base class.

Also applies to: 25-26

fastembed/late_interaction/late_interaction_text_embedding.py (1)

4-4: LGTM!

The tokenize implementation correctly normalizes input and delegates to the underlying model. Documentation is clear and complete.

Also applies to: 119-132

tests/test_late_interaction_embeddings.py (1)

297-315: LGTM!

The test correctly validates the tokenize method for late interaction embeddings, including the model-specific is_doc parameter for document vs query tokenization.

fastembed/sparse/sparse_embedding_base.py (1)

6-6: LGTM!

Correctly adds abstract tokenize interface to the sparse embedding base class, maintaining consistency across embedding types.

Also applies to: 48-49

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

4-4: LGTM!

The tokenize implementation correctly follows the established pattern: normalizes input and delegates to the underlying model with proper documentation.

Also applies to: 122-135

fastembed/sparse/sparse_text_embedding.py (2)

4-4: LGTM!

The import is necessary for the tokenize method's return type annotation.


96-109: LGTM!

The implementation correctly delegates to the underlying model's tokenize method and follows the same pattern as other embedding classes.

fastembed/sparse/splade_pp.py (1)

4-4: LGTM!

The import is necessary for the tokenize method's return type annotation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fastembed/late_interaction/colbert.py (1)

83-88: Pass **kwargs through to tokenizer methods.

The **kwargs parameter is declared but never passed to the underlying tokenization methods (_tokenize_documents and _tokenize_query), which in turn don't pass them to encode_batch. This breaks the documented interface contract that allows callers to pass additional arguments to the tokenizer.

Apply this diff to pass kwargs through the call chain:

 def tokenize(self, documents: list[str], is_doc: bool = True, **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
     return (
-        self._tokenize_documents(documents=documents)
+        self._tokenize_documents(documents=documents, **kwargs)
         if is_doc
-        else self._tokenize_query(query=next(iter(documents)))
+        else self._tokenize_query(query=next(iter(documents)), **kwargs)
     )

Then update the helper methods:

-def _tokenize_query(self, query: str) -> list[Encoding]:
+def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]:
     assert self.query_tokenizer is not None
-    encoded = self.query_tokenizer.encode_batch([query])
+    encoded = self.query_tokenizer.encode_batch([query], **kwargs)
     return encoded

-def _tokenize_documents(self, documents: list[str]) -> list[Encoding]:
+def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
-    encoded = self.tokenizer.encode_batch(documents)  # type: ignore[union-attr]
+    encoded = self.tokenizer.encode_batch(documents, **kwargs)  # type: ignore[union-attr]
     return encoded
♻️ Duplicate comments (1)
fastembed/sparse/splade_pp.py (1)

140-156: Critical issues from previous review remain unaddressed.

The past review feedback has not been fully addressed:

  1. Critical: **kwargs is still not passed to tokenizer.encode_batch() (line 156), breaking the documented interface that states "Additional arguments passed to the tokenizer".

  2. Major: The signature still doesn't match the supertype OnnxTextModel, which expects def tokenize(self, documents: list[str], **kwargs: Any). The type: ignore[override] suppresses the error but doesn't fix the design issue. Per the architecture shown in relevant snippets, the model class (SpladePP) should match OnnxTextModel's signature with list[str], while the wrapper class (SparseTextEmbedding) handles the Union[str, Iterable[str]] normalization.

  3. Minor: Using assert for the None check (line 155) can be bypassed with Python's -O flag. A RuntimeError would be more robust.

Apply this diff to fix all issues (as suggested in the previous review):

-    def tokenize(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
+    def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
         """
         Tokenize input texts using the model's tokenizer.
 
         Args:
-            texts: String or list of strings to tokenize
+            documents: List of strings to tokenize
             **kwargs: Additional arguments passed to the tokenizer
 
         Returns:
             List of tokenizer Encodings
         """
-        if isinstance(texts, str):
-            texts = [texts]
-        if not isinstance(texts, list):
-            texts = list(texts)
-        assert self.tokenizer is not None
-        return self.tokenizer.encode_batch(texts)
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized")
+        return self.tokenizer.encode_batch(documents, **kwargs)

Note: The single-string normalization logic is already handled by SparseTextEmbedding.tokenize (lines 105-108 in relevant snippets), so removing it here maintains proper separation of concerns.

🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)

163-172: Consider documenting or utilizing the kwargs parameter.

The **kwargs parameter is unused in this implementation, though it appears in the standardized interface. Consider either:

  1. Forwarding kwargs to the tokenizer if supported:
encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
  1. Adding a docstring to clarify the parameter is reserved for future use or API consistency:
def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
    """
    Tokenize input documents for ColPali queries.
    
    Args:
        documents: List of text queries to tokenize
        **kwargs: Reserved for future use (currently unused)
        
    Returns:
        List of tokenizer Encodings
    """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 555a362 and 514735f.

📒 Files selected for processing (3)
  • fastembed/late_interaction/colbert.py (1 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (1 hunks)
  • fastembed/sparse/splade_pp.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
fastembed/late_interaction_multimodal/colpali.py (5)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-156)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/late_interaction/colbert.py (3)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-156)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/sparse/splade_pp.py (3)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (48-49)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-109)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
🪛 Ruff (0.14.1)
fastembed/late_interaction_multimodal/colpali.py

163-163: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction/colbert.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/sparse/splade_pp.py

140-140: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (2)
fastembed/sparse/splade_pp.py (1)

4-4: LGTM!

The Encoding import is correctly added to support the new tokenize method's return type annotation.

fastembed/late_interaction_multimodal/colpali.py (1)

4-4: LGTM! Type annotations are correct.

The Encoding import and method signature with type: ignore[override] properly align with the standardized tokenize interface being introduced across embedding classes.

Also applies to: 163-163

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 514735f and b8c0eef.

📒 Files selected for processing (4)
  • fastembed/late_interaction/token_embeddings.py (1 hunks)
  • fastembed/sparse/bm42.py (1 hunks)
  • fastembed/sparse/minicoil.py (1 hunks)
  • fastembed/text/onnx_embedding.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
fastembed/late_interaction/token_embeddings.py (2)
fastembed/text/onnx_embedding.py (1)
  • OnnxTextEmbedding (186-330)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • LateInteractionTextEmbeddingBase (10-76)
fastembed/sparse/bm42.py (2)
fastembed/sparse/sparse_embedding_base.py (2)
  • SparseTextEmbeddingBase (35-92)
  • SparseEmbedding (14-32)
fastembed/text/onnx_text_model.py (1)
  • OnnxTextModel (17-154)
fastembed/sparse/minicoil.py (2)
fastembed/sparse/sparse_embedding_base.py (2)
  • SparseTextEmbeddingBase (35-92)
  • SparseEmbedding (14-32)
fastembed/text/onnx_text_model.py (1)
  • OnnxTextModel (17-154)
fastembed/text/onnx_embedding.py (2)
fastembed/text/text_embedding_base.py (1)
  • TextEmbeddingBase (9-75)
fastembed/text/onnx_text_model.py (1)
  • OnnxTextModel (17-154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)

170-171: Replace assertion with explicit error handling.

The assertion can be silently disabled in optimized Python mode (-O flag). Since self.tokenizer can be None when lazy_load=True and load_onnx_model() hasn't been called, use an explicit exception instead:

-        assert self.tokenizer is not None
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized. Call load_onnx_model() first or set lazy_load=False.")
         encoded = self.tokenizer.encode_batch(texts_query)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bef37cb and c9c0759.

📒 Files selected for processing (16)
  • fastembed/late_interaction/colbert.py (1 hunks)
  • fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
  • fastembed/sparse/sparse_embedding_base.py (2 hunks)
  • fastembed/sparse/sparse_text_embedding.py (2 hunks)
  • fastembed/sparse/splade_pp.py (2 hunks)
  • fastembed/text/onnx_embedding.py (3 hunks)
  • fastembed/text/onnx_text_model.py (1 hunks)
  • fastembed/text/text_embedding.py (2 hunks)
  • fastembed/text/text_embedding_base.py (2 hunks)
  • tests/test_sparse_embeddings.py (1 hunks)
  • tests/test_text_onnx_embeddings.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • fastembed/sparse/sparse_embedding_base.py
  • tests/test_text_onnx_embeddings.py
  • tests/test_sparse_embeddings.py
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
  • fastembed/sparse/sparse_text_embedding.py
  • fastembed/text/text_embedding_base.py
  • fastembed/late_interaction/late_interaction_text_embedding.py
🧰 Additional context used
🧬 Code graph analysis (9)
fastembed/text/text_embedding.py (3)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-334)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/late_interaction_multimodal/colpali.py (4)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
fastembed/sparse/splade_pp.py (1)
  • tokenize (140-153)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (8)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-334)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/late_interaction/colbert.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (4)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
fastembed/text/onnx_text_model.py (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-334)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
fastembed/late_interaction/late_interaction_embedding_base.py (6)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
fastembed/text/onnx_embedding.py (2)
fastembed/text/text_embedding_base.py (2)
  • TextEmbeddingBase (9-75)
  • tokenize (23-24)
fastembed/text/onnx_text_model.py (2)
  • OnnxTextModel (17-154)
  • tokenize (71-72)
fastembed/sparse/splade_pp.py (14)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-88)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • tokenize (83-84)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (48-49)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (96-107)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-334)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/onnx_multimodal_model.py

[error] 84-84: mypy: Item "None" of "Any | None" has no attribute "encode_batch" [union-attr]


[error] 84-84: mypy: Found 1 error in 1 file (checked 57 source files)

🪛 Ruff (0.14.1)
fastembed/late_interaction_multimodal/colpali.py

163-163: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction/colbert.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/text/onnx_text_model.py

71-71: Unused method argument: kwargs

(ARG002)

fastembed/sparse/splade_pp.py

152-152: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (10)
fastembed/sparse/splade_pp.py (1)

140-153: LGTM!

The tokenize implementation is correct with proper None check and kwargs forwarding. The past review concerns have been addressed.

fastembed/text/text_embedding.py (1)

166-177: LGTM!

The tokenize method correctly delegates to the underlying model with proper documentation. The implementation follows the pattern established across the codebase.

fastembed/late_interaction/late_interaction_embedding_base.py (2)

3-4: LGTM!

Import added to support the new tokenize method signature.


24-25: LGTM!

Abstract method correctly declares the tokenize interface for subclasses to implement.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)

3-3: LGTM!

Import added to support the new tokenize method signature.


25-26: LGTM!

Abstract method correctly declares the tokenize interface for subclasses to implement.

fastembed/text/onnx_embedding.py (2)

3-3: LGTM!

Import added to support the new tokenize method return type.


324-334: LGTM!

The tokenize method correctly delegates to the base OnnxTextModel.tokenize. The implementation is straightforward and well-documented.

fastembed/late_interaction/colbert.py (1)

83-88: LGTM!

The parameter rename from documents to texts maintains consistency with other tokenize methods. The type: ignore[override] is appropriate since this implementation has an additional is_doc parameter that intentionally differs from the base class signature.

fastembed/late_interaction_multimodal/colpali.py (1)

163-163: The unused kwargs parameter is required by the base class signature.

While static analysis flags **kwargs as unused, it's necessary to match the LateInteractionMultimodalEmbeddingBase.tokenize signature. This is not an issue.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fastembed/sparse/bm25.py (1)

151-156: Move the inner class definition outside the method.

The SimpleEncoding class is currently defined inside the tokenize method, meaning it's recreated on every call. For better performance, especially with high-frequency tokenization, define it at the class level or module level (see suggested fix in previous comment).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 710b5b2 and 0c47f30.

📒 Files selected for processing (4)
  • fastembed/sparse/bm25.py (2 hunks)
  • fastembed/sparse/bm42.py (3 hunks)
  • fastembed/sparse/minicoil.py (3 hunks)
  • tests/test_sparse_embeddings.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • fastembed/sparse/minicoil.py
  • fastembed/sparse/bm42.py
  • tests/test_sparse_embeddings.py
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/sparse/bm25.py (2)
fastembed/text/onnx_text_model.py (1)
  • tokenize (71-72)
fastembed/sparse/utils/tokenizer.py (2)
  • tokenize (8-12)
  • tokenize (83-120)
🪛 Ruff (0.14.1)
fastembed/sparse/bm25.py

140-140: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (1)
fastembed/sparse/bm25.py (1)

140-140: Unused kwargs parameter is acceptable for API consistency.

The static analysis tool flagged kwargs as unused, but this is likely intentional to maintain a consistent tokenize signature across all embedding classes (as seen in OnnxTextModel.tokenize). No action needed.

import mmh3
import numpy as np
from py_rust_stemmers import SnowballStemmer
from tokenizers import Encoding
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Type annotation is incorrect and breaks the Encoding interface contract.

The method signature claims to return list[Encoding], but the custom SimpleEncoding class doesn't properly implement the Encoding interface:

  1. Encoding.ids should be list[int] (token IDs), but here self.ids = tokens assigns a list[str]
  2. The type: ignore comments on lines 157-158 suppress type checker errors that correctly identify this mismatch
  3. Downstream code expecting actual Encoding objects with integer IDs will fail at runtime

Since BM25 uses string tokens rather than learned vocabulary IDs, consider one of these approaches:

Option 1 (Recommended): Return a proper protocol or custom type:

+from typing import Protocol
+
+class TokenizationResult(Protocol):
+    """Protocol for tokenization results."""
+    tokens: list[str]
+    ids: list[str]  # For BM25, tokens serve as IDs
+    attention_mask: list[int]
+
+# Move SimpleEncoding to class level
+class SimpleEncoding:
+    def __init__(self, tokens: list[str]):
+        self.tokens = tokens
+        self.ids = tokens
+        self.attention_mask = [1] * len(tokens)
+
 class Bm25(SparseTextEmbeddingBase):
     ...
-    def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]:
+    def tokenize(self, texts: list[str], **kwargs: Any) -> list[SimpleEncoding]:
         """Tokenize texts using SimpleTokenizer.
         
-        Returns a list of simple Encoding-like objects with token strings.
+        Returns a list of SimpleEncoding objects with token strings.
         Note: BM25 uses a simple word tokenizer, not a learned tokenizer.
+        Unlike standard Encoding objects, ids contain string tokens, not integers.
         """
         result = []
         for text in texts:
             tokens = self.tokenizer.tokenize(text)
-
-            # Create a simple object that mimics Encoding interface
-            class SimpleEncoding:
-                def __init__(self, tokens: list[str]):
-                    self.tokens = tokens
-                    self.ids = tokens  # For BM25, tokens are the IDs
-                    self.attention_mask = [1] * len(tokens)
-
-            result.append(SimpleEncoding(tokens))  # type: ignore[arg-type]
-        return result  # type: ignore[return-value]
+            result.append(SimpleEncoding(tokens))
+        return result

Option 2: Hash tokens to integers to match Encoding.ids contract:

     def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]:
         """Tokenize texts using SimpleTokenizer and hash to integer IDs."""
         result = []
         for text in texts:
             tokens = self.tokenizer.tokenize(text)
             
             class SimpleEncoding:
                 def __init__(self, tokens: list[str]):
                     self.tokens = tokens
-                    self.ids = tokens
+                    self.ids = [Bm25.compute_token_id(t) for t in tokens]
                     self.attention_mask = [1] * len(tokens)
             
-            result.append(SimpleEncoding(tokens))  # type: ignore[arg-type]
-        return result  # type: ignore[return-value]
+            result.append(SimpleEncoding(tokens))
+        return result

Also applies to: 140-158

🤖 Prompt for AI Agents
In fastembed/sparse/bm25.py around lines 10 and 140-158, the function currently
claims to return list[Encoding] but the SimpleEncoding class assigns string
tokens to self.ids (violating Encoding.ids: list[int]) and suppresses type
errors with type: ignore; fix by not using tokenizers.Encoding as the return
type: define a lightweight custom Protocol or dataclass (e.g.,
SimpleTokenEncoding) with fields matching how BM25 uses them (tokens: list[str],
offsets, attention_mask, etc.), update the function signature to return
List[SimpleTokenEncoding], remove the type: ignore comments, and adjust any
downstream code to use the new type; alternatively, if you must keep the
Encoding type, convert/hash tokens to integer IDs and ensure Encoding.ids is
list[int] and all Encoding interface fields are implemented consistently.

Comment on lines 151 to 155
class SimpleEncoding:
def __init__(self, tokens: list[str]):
self.tokens = tokens
self.ids = tokens # For BM25, tokens are the IDs
self.attention_mask = [1] * len(tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure yet if we want to mimic Encoding here
but we definitely don't want to create a class in a loop

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
fastembed/late_interaction/colbert.py (1)

86-93: Consider forwarding kwargs for consistency.

The **kwargs parameter is declared but not forwarded to the internal _tokenize_documents and _tokenize_query methods. While these methods don't currently use kwargs, forwarding them would maintain consistency with other tokenize implementations in this PR and enable future customization without changing the signature.

If you'd like to forward kwargs for consistency, apply this diff:

 def _tokenize(
         self, documents: list[str], is_doc: bool = True, **kwargs: Any
     ) -> list[Encoding]:
         return (
-            self._tokenize_documents(documents=documents)
+            self._tokenize_documents(documents=documents, **kwargs)
             if is_doc
-            else self._tokenize_query(query=next(iter(documents)))
+            else self._tokenize_query(query=next(iter(documents)), **kwargs)
         )

And update the helper methods:

-def _tokenize_query(self, query: str) -> list[Encoding]:
+def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]:
     assert self.query_tokenizer is not None
     encoded = self.query_tokenizer.encode_batch([query])
     return encoded

-def _tokenize_documents(self, documents: list[str]) -> list[Encoding]:
+def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
     encoded = self.tokenizer.encode_batch(documents)  # type: ignore[union-attr]
     return encoded
fastembed/sparse/sparse_embedding_base.py (1)

47-48: Add docstring and tests for the new public API method.

The tokenize method establishes a public API interface but lacks documentation. Consider adding a docstring that explains the expected parameters, return value structure, and intended usage pattern for when implementations become available.

Additionally, the PR checklist indicates no tests were added for this feature. While the current implementation is a stub, adding tests that verify the NotImplementedError behavior would establish a baseline for future implementations.

Apply this diff to add a docstring:

 def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
+    """
+    Tokenizes a list of documents.
+    
+    Args:
+        documents (list[str]): The list of documents to tokenize.
+        **kwargs: Additional keyword arguments for tokenization.
+    
+    Returns:
+        dict[str, Any]: Tokenized representation of the documents.
+    
+    Raises:
+        NotImplementedError: Tokenization is not yet implemented for sparse embeddings.
+    """
     raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c1e853 and 7bb1654.

📒 Files selected for processing (21)
  • fastembed/late_interaction/colbert.py (1 hunks)
  • fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
  • fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_model.py (1 hunks)
  • fastembed/sparse/bm25.py (1 hunks)
  • fastembed/sparse/bm42.py (1 hunks)
  • fastembed/sparse/minicoil.py (1 hunks)
  • fastembed/sparse/sparse_embedding_base.py (1 hunks)
  • fastembed/sparse/sparse_text_embedding.py (2 hunks)
  • fastembed/sparse/splade_pp.py (2 hunks)
  • fastembed/text/onnx_embedding.py (2 hunks)
  • fastembed/text/onnx_text_model.py (1 hunks)
  • fastembed/text/text_embedding.py (2 hunks)
  • fastembed/text/text_embedding_base.py (2 hunks)
  • tests/test_late_interaction_embeddings.py (2 hunks)
  • tests/test_sparse_embeddings.py (1 hunks)
  • tests/test_text_onnx_embeddings.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
  • fastembed/sparse/splade_pp.py
  • tests/test_sparse_embeddings.py
  • fastembed/text/onnx_embedding.py
  • tests/test_text_onnx_embeddings.py
  • fastembed/sparse/bm42.py
  • fastembed/text/text_embedding_base.py
  • fastembed/sparse/bm25.py
  • tests/test_late_interaction_embeddings.py
  • fastembed/late_interaction/late_interaction_embedding_base.py
🧰 Additional context used
🧬 Code graph analysis (10)
fastembed/sparse/sparse_embedding_base.py (6)
fastembed/sparse/bm25.py (1)
  • tokenize (139-140)
fastembed/sparse/bm42.py (1)
  • tokenize (133-134)
fastembed/sparse/minicoil.py (1)
  • tokenize (141-142)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (95-96)
fastembed/sparse/splade_pp.py (1)
  • tokenize (139-140)
fastembed/sparse/utils/tokenizer.py (1)
  • tokenize (8-12)
fastembed/late_interaction/colbert.py (6)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (2)
  • tokenize (163-164)
  • _tokenize (166-175)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-325)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • _tokenize (83-86)
fastembed/text/onnx_text_model.py (1)
  • _tokenize (71-72)
fastembed/text/text_embedding.py (7)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-164)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-325)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)
fastembed/late_interaction/colbert.py (1)
  • _tokenize (86-93)
fastembed/late_interaction_multimodal/colpali.py (1)
  • _tokenize (166-175)
fastembed/text/onnx_text_model.py (1)
  • _tokenize (71-72)
fastembed/common/onnx_model.py (1)
  • OnnxOutputContext (20-23)
fastembed/late_interaction_multimodal/colpali.py (3)
fastembed/late_interaction/colbert.py (1)
  • _tokenize (86-93)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • _tokenize (83-86)
fastembed/text/onnx_text_model.py (1)
  • _tokenize (71-72)
fastembed/late_interaction/late_interaction_text_embedding.py (4)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
fastembed/sparse/minicoil.py (5)
fastembed/sparse/bm25.py (1)
  • tokenize (139-140)
fastembed/sparse/bm42.py (1)
  • tokenize (133-134)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (47-48)
fastembed/sparse/sparse_text_embedding.py (1)
  • tokenize (95-96)
fastembed/sparse/splade_pp.py (1)
  • tokenize (139-140)
fastembed/sparse/sparse_text_embedding.py (2)
fastembed/common/model_description.py (1)
  • SparseModelDescription (44-46)
fastembed/sparse/sparse_embedding_base.py (1)
  • tokenize (47-48)
fastembed/text/onnx_text_model.py (4)
fastembed/late_interaction/colbert.py (1)
  • _tokenize (86-93)
fastembed/late_interaction_multimodal/colpali.py (1)
  • _tokenize (166-175)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
  • _tokenize (83-86)
fastembed/common/onnx_model.py (2)
  • onnx_embed (119-120)
  • OnnxOutputContext (20-23)
fastembed/rerank/cross_encoder/onnx_text_model.py (6)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-164)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-325)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
🪛 Ruff (0.14.3)
fastembed/late_interaction/colbert.py

87-87: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

85-85: Avoid specifying long messages outside the exception class

(TRY003)

fastembed/late_interaction_multimodal/colpali.py

166-166: Unused method argument: kwargs

(ARG002)

fastembed/text/onnx_text_model.py

71-71: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (4)
fastembed/late_interaction/late_interaction_text_embedding.py (1)

119-130: LGTM!

The tokenize method implementation is clean and consistent with the pattern established across other embedding classes in this PR.

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

48-49: LGTM!

The change properly forwards **kwargs to encode_batch, enabling customizable tokenization behavior for reranking pairs.

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

83-86: LGTM!

The _tokenize method properly guards against uninitialized tokenizer and correctly forwards **kwargs to encode_batch.

fastembed/sparse/sparse_text_embedding.py (1)

1-7: ****

The imports are not accidentally included. The git diff shows they were reorganized from their original positions in main (where they were at the bottom of the imports section) to follow PEP 8 style conventions (stdlib imports first, then project imports). Both imports are actively used: warnings on line 67 for a deprecation warning, and SparseModelDescription on lines 48-49 in the _list_supported_models method type hints. This reorganization is a legitimate cleanup that accompanies the tokenize method addition.

Likely an incorrect or invalid review comment.

Comment on lines +166 to 175
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
texts_query: list[str] = []
for query in documents:
query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
query += "\n"

texts_query.append(query)
encoded = self.tokenizer.encode_batch(texts_query) # type: ignore[union-attr]
assert self.tokenizer is not None
encoded = self.tokenizer.encode_batch(texts_query)
return encoded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Forward kwargs to enable tokenization customization.

The **kwargs parameter is declared but not forwarded to encode_batch at line 174. This prevents callers from customizing tokenization behavior. Other implementations in this PR forward kwargs (e.g., onnx_multimodal_model.py).

Apply this diff:

 def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
     texts_query: list[str] = []
     for query in documents:
         query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
         query += "\n"

         texts_query.append(query)
     assert self.tokenizer is not None
-    encoded = self.tokenizer.encode_batch(texts_query)
+    encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
     return encoded
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
texts_query: list[str] = []
for query in documents:
query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
query += "\n"
texts_query.append(query)
encoded = self.tokenizer.encode_batch(texts_query) # type: ignore[union-attr]
assert self.tokenizer is not None
encoded = self.tokenizer.encode_batch(texts_query)
return encoded
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
texts_query: list[str] = []
for query in documents:
query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
query += "\n"
texts_query.append(query)
assert self.tokenizer is not None
encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
return encoded
🧰 Tools
🪛 Ruff (0.14.3)

166-166: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colpali.py around lines 166 to 175, the
method _tokenize accepts **kwargs but does not pass them to
self.tokenizer.encode_batch; update the call to forward kwargs (e.g.,
self.tokenizer.encode_batch(texts_query, **kwargs)) while keeping the existing
assertion that tokenizer is not None so callers can customize tokenization
options like truncation, padding, or return_tensors.

Comment on lines +95 to +96
def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Remove duplicate implementation or delegate to self.model.

The tokenize method duplicates the base class implementation exactly. Since SparseTextEmbedding follows a delegation pattern for other methods (embed delegates to self.model.embed(), query_embed delegates to self.model.query_embed()), the tokenize method should either:

  1. Preferred: Delegate to self.model.tokenize() for consistency with the existing pattern, allowing concrete implementations to provide their own tokenization logic in the future.
  2. Alternative: Remove this override entirely and inherit the base class implementation.

The current implementation breaks the delegation pattern and creates unnecessary duplication.

Option 1 (Preferred): Delegate to self.model

 def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
-    raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
+    """
+    Tokenizes a list of documents.
+    
+    Args:
+        documents (list[str]): The list of documents to tokenize.
+        **kwargs: Additional keyword arguments for tokenization.
+    
+    Returns:
+        dict[str, Any]: Tokenized representation of the documents.
+    """
+    return self.model.tokenize(documents, **kwargs)

Option 2 (Alternative): Remove the override

-def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
-    raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")
-

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In fastembed/sparse/sparse_text_embedding.py around lines 95-96, the tokenize
override duplicates the base class implementation and breaks the delegation
pattern; change the method to delegate to the underlying model by returning
self.model.tokenize(documents, **kwargs) (preferred), or remove this override
entirely so the class inherits the base implementation; ensure to pass through
kwargs and return the result unchanged.

Comment on lines +71 to +72
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
return self.tokenizer.encode_batch(documents) # type:ignore[union-attr]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Forward kwargs to enable tokenization customization.

The **kwargs parameter is declared but not forwarded to encode_batch. This breaks customization options that callers might pass through the tokenization chain. Other implementations in this PR (e.g., onnx_multimodal_model.py, onnx_text_model.py in rerank) forward kwargs.

Apply this diff:

 def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
-    return self.tokenizer.encode_batch(documents)  # type:ignore[union-attr]
+    return self.tokenizer.encode_batch(documents, **kwargs)  # type:ignore[union-attr]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
return self.tokenizer.encode_batch(documents) # type:ignore[union-attr]
def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
return self.tokenizer.encode_batch(documents, **kwargs) # type:ignore[union-attr]
🧰 Tools
🪛 Ruff (0.14.3)

71-71: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents
In fastembed/text/onnx_text_model.py around lines 71 to 72, the _tokenize method
accepts **kwargs but does not forward them to tokenizer.encode_batch; update the
return to call self.tokenizer.encode_batch(documents, **kwargs) (preserving the
existing type-ignore if needed) so callers can customize tokenization options
and match other implementations in this PR.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fastembed/text/text_embedding_base.py (1)

23-24: Consider alignment with existing abstract method patterns.

The new tokenize method includes a descriptive message in NotImplementedError, while other abstract methods in this class (lines 33, 70) raise NotImplementedError() without a message. For consistency, consider either:

  • Removing the message: raise NotImplementedError()
  • Adding the message to other abstract methods

Additionally, consider adding a docstring to document the expected behavior for subclasses, similar to the docstrings in concrete implementations like TextEmbedding.tokenize (fastembed/text/text_embedding.py, lines 165-176).

fastembed/late_interaction/late_interaction_embedding_base.py (1)

24-25: Consider alignment with existing abstract method patterns.

Similar to the comment on TextEmbeddingBase, the new tokenize method includes a descriptive message in NotImplementedError, while the embed method (line 34) raises NotImplementedError() without a message. Consider standardizing the approach across abstract methods in this class.

Additionally, consider adding a docstring to document the expected behavior for subclasses, matching the pattern used in concrete implementations like LateInteractionTextEmbedding.tokenize (fastembed/late_interaction/late_interaction_text_embedding.py, lines 118-129).

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

25-26: Consider alignment with existing patterns and verify test coverage.

Similar to the other base classes, the new tokenize method includes a descriptive message in NotImplementedError, while embed_text and embed_image methods (lines 50, 72) raise NotImplementedError() without messages. Consider standardizing the approach.

Additionally, consider adding a docstring matching the pattern in LateInteractionMultimodalEmbedding.tokenize (fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py, lines 121-132).

Test coverage: The PR checklist indicates "Added tests for the feature: ⛔", but the AI summary mentions tests were added in tests/test_late_interaction_embeddings.py and tests/test_late_interaction_multimodal.py. Please verify that the new public tokenize API has adequate test coverage across all embedding types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb1654 and fe6d0dd.

📒 Files selected for processing (3)
  • fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
  • fastembed/text/text_embedding_base.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
fastembed/text/text_embedding_base.py (9)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-164)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-325)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • tokenize (48-49)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (6)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • tokenize (24-25)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/colpali.py (1)
  • tokenize (163-164)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/late_interaction/late_interaction_embedding_base.py (7)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • tokenize (25-26)
fastembed/text/text_embedding_base.py (1)
  • tokenize (23-24)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • tokenize (119-130)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • tokenize (122-133)
fastembed/late_interaction/colbert.py (1)
  • tokenize (83-84)
fastembed/text/onnx_embedding.py (1)
  • tokenize (324-325)
fastembed/text/text_embedding.py (1)
  • tokenize (166-177)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/text/text_embedding_base.py (1)

2-2: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

3-3: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

3-3: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

@joein
Copy link
Member

joein commented Nov 10, 2025

We've decided not to expose tokenize, but just provide token_count

@dancixx dancixx force-pushed the feat/expose-tokenize-method branch from fe6d0dd to 683eba2 Compare November 18, 2025 13:08
…ed token counts; raise NotImplemented for sparse
…non-padded token counts; raise NotImplemented for sparse"

This reverts commit a26aad6.
@joein
Copy link
Member

joein commented Dec 7, 2025

closing in favour of #583

@joein joein closed this Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants