feat: add public tokenize method for TextEmbedding class #564

dancixx · 2025-10-21T15:37:32Z

Expose tokenize method instead of using embedding_model.model.tokenize.

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass the existing tests?
Have you added tests for your feature?
Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

Have you added an explanation of why it's important to include this model?
Have you added tests for the new model? Were canonical values for tests computed via the original model?
Have you added the code snippet for how canonical values were computed?
Have you successfully ran tests with your changes locally?

coderabbitai · 2025-10-21T15:42:37Z

📝 Walkthrough

Walkthrough

This PR adds a public tokenize API across multiple embedding classes and their bases. Base classes (TextEmbeddingBase, LateInteractionTextEmbeddingBase, LateInteractionMultimodalEmbeddingBase, SparseTextEmbeddingBase) receive a tokenize method (raising NotImplementedError). Dense/late-interaction/onnx implementations delegate to underlying model tokenizers and return list[Encoding]. Sparse implementations add tokenize stubs returning dict[str, Any] and currently raise NotImplementedError. Several model-level methods were renamed/redispatched (e.g., public tokenize → private _tokenize in some ONNX/multimodal modules). Tests for tokenize were added for multiple model types.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas needing extra attention:
- ColBERT refactor: removal of is_doc from public signature and addition of _tokenize delegator.
- Consistency of return types between dense (list[Encoding]) and sparse (dict[str, Any]) tokenizers and callers.
- Visibility changes: public tokenize → private _tokenize in ONNX / multimodal modules and updated call sites.
- Tests added for tokenization behavior (validate CI cleanup and model-specific assumptions).
- New NotImplementedError stubs in sparse classes (ensure callers expect/handle this).

Possibly related PRs

MiniCOIL v1 #513 — Modifies MiniCOIL implementation; relates to this PR which added a tokenize stub to minicoil.py.

Suggested reviewers

generall

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.81% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature being added: a public tokenize method exposed on the TextEmbedding class instead of requiring access to the underlying model.
Description check	✅ Passed	The description clearly states the purpose of the PR - exposing a public tokenize method on TextEmbedding class rather than using embedding_model.model.tokenize, and includes relevant submission checklist items.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/expose-tokenize-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ba1f605 and 221a331.

📒 Files selected for processing (2)

fastembed/text/text_embedding.py (2 hunks)
fastembed/text/text_embedding_base.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

fastembed/text/text_embedding.py (5)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-171)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/text/text_embedding_base.py (5)

fastembed/text/text_embedding.py (1)

tokenize (166-179)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-171)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test

🔇 Additional comments (4)

fastembed/text/text_embedding_base.py (2)

2-2: LGTM: Import is correct.

The Encoding type from the tokenizers library is the appropriate return type for tokenization methods.

23-24: LGTM: Abstract method design is appropriate.

The abstract tokenize method properly defines the interface contract for subclasses. The flexible signature accepting both single strings and iterables with **kwargs is well-designed.

fastembed/text/text_embedding.py (2)

3-3: LGTM: Import is consistent.

The import matches the base class and is correct for the return type annotation.

166-179: Implementation is correct; all registered models have tokenize method available.

Verification confirms that all six models in EMBEDDINGS_REGISTRY (OnnxTextEmbedding, CLIPOnnxEmbedding, PooledEmbedding, PooledNormalizedEmbedding, JinaEmbeddingV3, CustomTextEmbedding) inherit directly or indirectly from OnnxTextEmbedding, which implements the tokenize method. The delegation pattern and string normalization logic are sound.

fastembed/text/text_embedding.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 221a331 and 555a362.

📒 Files selected for processing (11)

fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
fastembed/sparse/sparse_embedding_base.py (2 hunks)
fastembed/sparse/sparse_text_embedding.py (2 hunks)
fastembed/sparse/splade_pp.py (2 hunks)
tests/test_late_interaction_embeddings.py (1 hunks)
tests/test_late_interaction_multimodal.py (1 hunks)
tests/test_sparse_embeddings.py (1 hunks)
tests/test_text_onnx_embeddings.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/test_late_interaction_multimodal.py

🧰 Additional context used

🧬 Code graph analysis (10)

tests/test_late_interaction_embeddings.py (3)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-132)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

tests/utils.py (1)

delete_model_cache (11-39)

fastembed/late_interaction/late_interaction_text_embedding.py (4)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-135)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-109)

fastembed/text/text_embedding.py (1)

tokenize (166-179)

fastembed/sparse/sparse_embedding_base.py (8)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-132)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-135)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-109)

fastembed/sparse/splade_pp.py (1)

tokenize (140-153)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/text/text_embedding.py (1)

tokenize (166-179)

fastembed/sparse/splade_pp.py (3)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (48-49)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-109)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/late_interaction/late_interaction_embedding_base.py (3)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-132)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-135)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

tests/test_sparse_embeddings.py (5)

tests/test_late_interaction_embeddings.py (1)

test_tokenize (298-315)

tests/test_text_onnx_embeddings.py (1)

test_tokenize (181-198)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-109)

fastembed/sparse/splade_pp.py (1)

tokenize (140-153)

tests/utils.py (1)

delete_model_cache (11-39)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (3)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-132)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-135)

tests/test_text_onnx_embeddings.py (6)

tests/test_late_interaction_embeddings.py (1)

test_tokenize (298-315)

tests/test_late_interaction_multimodal.py (1)

test_tokenize (107-123)

tests/test_sparse_embeddings.py (1)

test_tokenize (250-267)

fastembed/text/text_embedding.py (2)

TextEmbedding (17-230)

tokenize (166-179)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

tests/utils.py (1)

delete_model_cache (11-39)

fastembed/sparse/sparse_text_embedding.py (5)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-132)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (48-49)

fastembed/sparse/splade_pp.py (1)

tokenize (140-153)

fastembed/text/text_embedding.py (1)

tokenize (166-179)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-171)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

🪛 GitHub Actions: type-checkers

fastembed/sparse/splade_pp.py

[error] 140-140: Signature of "tokenize" incompatible with supertype "OnnxTextModel"

[error] 153-153: Item "None" of "Any | None" has no attribute "encode_batch"

🪛 Ruff (0.14.1)

fastembed/sparse/splade_pp.py

140-140: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.9.x on windows-latest test

🔇 Additional comments (11)

tests/test_text_onnx_embeddings.py (1)

180-198: LGTM!

The test correctly validates the tokenize method for both single string and list inputs, ensuring encodings have non-empty ids.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

3-3: LGTM!

Correctly adds abstract tokenize interface to the base class, establishing the contract for subclasses.

Also applies to: 24-25

tests/test_sparse_embeddings.py (1)

249-267: LGTM!

The test correctly validates the tokenize method for sparse embeddings with both single string and list inputs.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

3-3: LGTM!

Correctly adds abstract tokenize interface to the multimodal embedding base class.

Also applies to: 25-26

fastembed/late_interaction/late_interaction_text_embedding.py (1)

4-4: LGTM!

The tokenize implementation correctly normalizes input and delegates to the underlying model. Documentation is clear and complete.

Also applies to: 119-132

tests/test_late_interaction_embeddings.py (1)

297-315: LGTM!

The test correctly validates the tokenize method for late interaction embeddings, including the model-specific is_doc parameter for document vs query tokenization.

fastembed/sparse/sparse_embedding_base.py (1)

6-6: LGTM!

Correctly adds abstract tokenize interface to the sparse embedding base class, maintaining consistency across embedding types.

Also applies to: 48-49

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

4-4: LGTM!

The tokenize implementation correctly follows the established pattern: normalizes input and delegates to the underlying model with proper documentation.

Also applies to: 122-135

fastembed/sparse/sparse_text_embedding.py (2)

4-4: LGTM!

The import is necessary for the tokenize method's return type annotation.

96-109: LGTM!

The implementation correctly delegates to the underlying model's tokenize method and follows the same pattern as other embedding classes.

fastembed/sparse/splade_pp.py (1)

4-4: LGTM!

The import is necessary for the tokenize method's return type annotation.

fastembed/sparse/splade_pp.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

fastembed/late_interaction/colbert.py (1)

83-88: Pass **kwargs through to tokenizer methods.

The **kwargs parameter is declared but never passed to the underlying tokenization methods (_tokenize_documents and _tokenize_query), which in turn don't pass them to encode_batch. This breaks the documented interface contract that allows callers to pass additional arguments to the tokenizer.

Apply this diff to pass kwargs through the call chain:

 def tokenize(self, documents: list[str], is_doc: bool = True, **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
     return (
-        self._tokenize_documents(documents=documents)
+        self._tokenize_documents(documents=documents, **kwargs)
         if is_doc
-        else self._tokenize_query(query=next(iter(documents)))
+        else self._tokenize_query(query=next(iter(documents)), **kwargs)
     )

Then update the helper methods:

-def _tokenize_query(self, query: str) -> list[Encoding]:
+def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]:
     assert self.query_tokenizer is not None
-    encoded = self.query_tokenizer.encode_batch([query])
+    encoded = self.query_tokenizer.encode_batch([query], **kwargs)
     return encoded

-def _tokenize_documents(self, documents: list[str]) -> list[Encoding]:
+def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
-    encoded = self.tokenizer.encode_batch(documents)  # type: ignore[union-attr]
+    encoded = self.tokenizer.encode_batch(documents, **kwargs)  # type: ignore[union-attr]
     return encoded

♻️ Duplicate comments (1)

fastembed/sparse/splade_pp.py (1)
140-156: Critical issues from previous review remain unaddressed.

The past review feedback has not been fully addressed:

Critical: **kwargs is still not passed to tokenizer.encode_batch() (line 156), breaking the documented interface that states "Additional arguments passed to the tokenizer".

Major: The signature still doesn't match the supertype OnnxTextModel, which expects def tokenize(self, documents: list[str], **kwargs: Any). The type: ignore[override] suppresses the error but doesn't fix the design issue. Per the architecture shown in relevant snippets, the model class (SpladePP) should match OnnxTextModel's signature with list[str], while the wrapper class (SparseTextEmbedding) handles the Union[str, Iterable[str]] normalization.

Minor: Using assert for the None check (line 155) can be bypassed with Python's -O flag. A RuntimeError would be more robust.

Apply this diff to fix all issues (as suggested in the previous review):
-    def tokenize(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
+    def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
         """
         Tokenize input texts using the model's tokenizer.
 
         Args:
-            texts: String or list of strings to tokenize
+            documents: List of strings to tokenize
             **kwargs: Additional arguments passed to the tokenizer
 
         Returns:
             List of tokenizer Encodings
         """
-        if isinstance(texts, str):
-            texts = [texts]
-        if not isinstance(texts, list):
-            texts = list(texts)
-        assert self.tokenizer is not None
-        return self.tokenizer.encode_batch(texts)
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized")
+        return self.tokenizer.encode_batch(documents, **kwargs)
Note: The single-string normalization logic is already handled by SparseTextEmbedding.tokenize (lines 105-108 in relevant snippets), so removing it here maintains proper separation of concerns.

🧹 Nitpick comments (1)

fastembed/late_interaction_multimodal/colpali.py (1)
163-172: Consider documenting or utilizing the kwargs parameter.

The **kwargs parameter is unused in this implementation, though it appears in the standardized interface. Consider either:

Forwarding kwargs to the tokenizer if supported:
encoded = self.tokenizer.encode_batch(texts_query, **kwargs)
Adding a docstring to clarify the parameter is reserved for future use or API consistency:
def tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:  # type: ignore[override]
    """
    Tokenize input documents for ColPali queries.
    
    Args:
        documents: List of text queries to tokenize
        **kwargs: Reserved for future use (currently unused)
        
    Returns:
        List of tokenizer Encodings
    """

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 555a362 and 514735f.

📒 Files selected for processing (3)

fastembed/late_interaction/colbert.py (1 hunks)
fastembed/late_interaction_multimodal/colpali.py (1 hunks)
fastembed/sparse/splade_pp.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

fastembed/late_interaction_multimodal/colpali.py (5)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/sparse/splade_pp.py (1)

tokenize (140-156)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/late_interaction/colbert.py (3)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/sparse/splade_pp.py (1)

tokenize (140-156)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/sparse/splade_pp.py (3)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (48-49)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-109)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

🪛 Ruff (0.14.1)

fastembed/late_interaction_multimodal/colpali.py

163-163: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction/colbert.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/sparse/splade_pp.py

140-140: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.9.x on windows-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test

🔇 Additional comments (2)

fastembed/sparse/splade_pp.py (1)

4-4: LGTM!

The Encoding import is correctly added to support the new tokenize method's return type annotation.

fastembed/late_interaction_multimodal/colpali.py (1)

4-4: LGTM! Type annotations are correct.

The Encoding import and method signature with type: ignore[override] properly align with the standardized tokenize interface being introduced across embedding classes.

Also applies to: 163-163

fastembed/late_interaction_multimodal/colpali.py

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 514735f and b8c0eef.

📒 Files selected for processing (4)

fastembed/late_interaction/token_embeddings.py (1 hunks)
fastembed/sparse/bm42.py (1 hunks)
fastembed/sparse/minicoil.py (1 hunks)
fastembed/text/onnx_embedding.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

fastembed/late_interaction/token_embeddings.py (2)

fastembed/text/onnx_embedding.py (1)

OnnxTextEmbedding (186-330)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

LateInteractionTextEmbeddingBase (10-76)

fastembed/sparse/bm42.py (2)

fastembed/sparse/sparse_embedding_base.py (2)

SparseTextEmbeddingBase (35-92)

SparseEmbedding (14-32)

fastembed/text/onnx_text_model.py (1)

OnnxTextModel (17-154)

fastembed/sparse/minicoil.py (2)

fastembed/sparse/sparse_embedding_base.py (2)

SparseTextEmbeddingBase (35-92)

SparseEmbedding (14-32)

fastembed/text/onnx_text_model.py (1)

OnnxTextModel (17-154)

fastembed/text/onnx_embedding.py (2)

fastembed/text/text_embedding_base.py (1)

TextEmbeddingBase (9-75)

fastembed/text/onnx_text_model.py (1)

OnnxTextModel (17-154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.9.x on windows-latest test
GitHub Check: Python 3.9.x on macos-latest test

fastembed/late_interaction/token_embeddings.py

fastembed/sparse/bm42.py

fastembed/sparse/minicoil.py

fastembed/text/onnx_embedding.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

fastembed/late_interaction_multimodal/colpali.py (1)
170-171: Replace assertion with explicit error handling.

The assertion can be silently disabled in optimized Python mode (-O flag). Since self.tokenizer can be None when lazy_load=True and load_onnx_model() hasn't been called, use an explicit exception instead:
-        assert self.tokenizer is not None
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized. Call load_onnx_model() first or set lazy_load=False.")
         encoded = self.tokenizer.encode_batch(texts_query)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bef37cb and c9c0759.

📒 Files selected for processing (16)

fastembed/late_interaction/colbert.py (1 hunks)
fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/colpali.py (1 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
fastembed/sparse/sparse_embedding_base.py (2 hunks)
fastembed/sparse/sparse_text_embedding.py (2 hunks)
fastembed/sparse/splade_pp.py (2 hunks)
fastembed/text/onnx_embedding.py (3 hunks)
fastembed/text/onnx_text_model.py (1 hunks)
fastembed/text/text_embedding.py (2 hunks)
fastembed/text/text_embedding_base.py (2 hunks)
tests/test_sparse_embeddings.py (1 hunks)
tests/test_text_onnx_embeddings.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (7)

fastembed/sparse/sparse_embedding_base.py
tests/test_text_onnx_embeddings.py
tests/test_sparse_embeddings.py
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
fastembed/sparse/sparse_text_embedding.py
fastembed/text/text_embedding_base.py
fastembed/late_interaction/late_interaction_text_embedding.py

🧰 Additional context used

🧬 Code graph analysis (9)

fastembed/text/text_embedding.py (3)

fastembed/text/onnx_embedding.py (1)

tokenize (324-334)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/late_interaction_multimodal/colpali.py (4)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

fastembed/sparse/splade_pp.py (1)

tokenize (140-153)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (8)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/text/onnx_embedding.py (1)

tokenize (324-334)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/late_interaction/colbert.py (3)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (4)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

fastembed/text/onnx_text_model.py (4)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/text/onnx_embedding.py (1)

tokenize (324-334)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

fastembed/late_interaction/late_interaction_embedding_base.py (6)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

fastembed/text/onnx_embedding.py (2)

fastembed/text/text_embedding_base.py (2)

TextEmbeddingBase (9-75)

tokenize (23-24)

fastembed/text/onnx_text_model.py (2)

OnnxTextModel (17-154)

tokenize (71-72)

fastembed/sparse/splade_pp.py (14)

fastembed/late_interaction/colbert.py (1)

tokenize (83-88)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-172)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

tokenize (83-84)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (48-49)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (96-107)

fastembed/text/onnx_embedding.py (1)

tokenize (324-334)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

🪛 GitHub Actions: type-checkers

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

[error] 84-84: mypy: Item "None" of "Any | None" has no attribute "encode_batch" [union-attr]

[error] 84-84: mypy: Found 1 error in 1 file (checked 57 source files)

🪛 Ruff (0.14.1)

fastembed/late_interaction_multimodal/colpali.py

163-163: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction/colbert.py

83-83: Unused method argument: kwargs

(ARG002)

fastembed/text/onnx_text_model.py

71-71: Unused method argument: kwargs

(ARG002)

fastembed/sparse/splade_pp.py

152-152: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.9.x on windows-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test

🔇 Additional comments (10)

fastembed/sparse/splade_pp.py (1)

140-153: LGTM!

The tokenize implementation is correct with proper None check and kwargs forwarding. The past review concerns have been addressed.

fastembed/text/text_embedding.py (1)

166-177: LGTM!

The tokenize method correctly delegates to the underlying model with proper documentation. The implementation follows the pattern established across the codebase.

fastembed/late_interaction/late_interaction_embedding_base.py (2)

3-4: LGTM!

Import added to support the new tokenize method signature.

24-25: LGTM!

Abstract method correctly declares the tokenize interface for subclasses to implement.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)

3-3: LGTM!

Import added to support the new tokenize method signature.

25-26: LGTM!

Abstract method correctly declares the tokenize interface for subclasses to implement.

fastembed/text/onnx_embedding.py (2)

3-3: LGTM!

Import added to support the new tokenize method return type.

324-334: LGTM!

The tokenize method correctly delegates to the base OnnxTextModel.tokenize. The implementation is straightforward and well-documented.

fastembed/late_interaction/colbert.py (1)

83-88: LGTM!

The parameter rename from documents to texts maintains consistency with other tokenize methods. The type: ignore[override] is appropriate since this implementation has an additional is_doc parameter that intentionally differs from the base class signature.

fastembed/late_interaction_multimodal/colpali.py (1)

163-163: The unused kwargs parameter is required by the base class signature.

While static analysis flags **kwargs as unused, it's necessary to match the LateInteractionMultimodalEmbeddingBase.tokenize signature. This is not an issue.

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

fastembed/text/onnx_text_model.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

fastembed/sparse/bm25.py (1)

151-156: Move the inner class definition outside the method.

The SimpleEncoding class is currently defined inside the tokenize method, meaning it's recreated on every call. For better performance, especially with high-frequency tokenization, define it at the class level or module level (see suggested fix in previous comment).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 710b5b2 and 0c47f30.

📒 Files selected for processing (4)

fastembed/sparse/bm25.py (2 hunks)
fastembed/sparse/bm42.py (3 hunks)
fastembed/sparse/minicoil.py (3 hunks)
tests/test_sparse_embeddings.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

fastembed/sparse/minicoil.py
fastembed/sparse/bm42.py
tests/test_sparse_embeddings.py

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/sparse/bm25.py (2)

fastembed/text/onnx_text_model.py (1)

tokenize (71-72)

fastembed/sparse/utils/tokenizer.py (2)

tokenize (8-12)

tokenize (83-120)

🪛 Ruff (0.14.1)

fastembed/sparse/bm25.py

140-140: Unused method argument: kwargs

(ARG002)

🔇 Additional comments (1)

fastembed/sparse/bm25.py (1)

140-140: Unused kwargs parameter is acceptable for API consistency.

The static analysis tool flagged kwargs as unused, but this is likely intentional to maintain a consistent tokenize signature across all embedding classes (as seen in OnnxTextModel.tokenize). No action needed.

coderabbitai · 2025-10-22T12:22:02Z

fastembed/sparse/bm25.py

 import mmh3
 import numpy as np
 from py_rust_stemmers import SnowballStemmer
+from tokenizers import Encoding


⚠️ Potential issue | 🔴 Critical

Type annotation is incorrect and breaks the Encoding interface contract.

The method signature claims to return list[Encoding], but the custom SimpleEncoding class doesn't properly implement the Encoding interface:

Encoding.ids should be list[int] (token IDs), but here self.ids = tokens assigns a list[str]

The type: ignore comments on lines 157-158 suppress type checker errors that correctly identify this mismatch

Downstream code expecting actual Encoding objects with integer IDs will fail at runtime

Since BM25 uses string tokens rather than learned vocabulary IDs, consider one of these approaches:

Option 1 (Recommended): Return a proper protocol or custom type:

+from typing import Protocol + +class TokenizationResult(Protocol): + """Protocol for tokenization results.""" + tokens: list[str] + ids: list[str] # For BM25, tokens serve as IDs + attention_mask: list[int] + +# Move SimpleEncoding to class level +class SimpleEncoding: + def __init__(self, tokens: list[str]): + self.tokens = tokens + self.ids = tokens + self.attention_mask = [1] * len(tokens) + class Bm25(SparseTextEmbeddingBase): ... - def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]: + def tokenize(self, texts: list[str], **kwargs: Any) -> list[SimpleEncoding]: """Tokenize texts using SimpleTokenizer. - Returns a list of simple Encoding-like objects with token strings. + Returns a list of SimpleEncoding objects with token strings. Note: BM25 uses a simple word tokenizer, not a learned tokenizer. + Unlike standard Encoding objects, ids contain string tokens, not integers. """ result = [] for text in texts: tokens = self.tokenizer.tokenize(text) - - # Create a simple object that mimics Encoding interface - class SimpleEncoding: - def __init__(self, tokens: list[str]): - self.tokens = tokens - self.ids = tokens # For BM25, tokens are the IDs - self.attention_mask = [1] * len(tokens) - - result.append(SimpleEncoding(tokens)) # type: ignore[arg-type] - return result # type: ignore[return-value] + result.append(SimpleEncoding(tokens)) + return result

Option 2: Hash tokens to integers to match Encoding.ids contract:

def tokenize(self, texts: list[str], **kwargs: Any) -> list[Encoding]: """Tokenize texts using SimpleTokenizer and hash to integer IDs.""" result = [] for text in texts: tokens = self.tokenizer.tokenize(text) class SimpleEncoding: def __init__(self, tokens: list[str]): self.tokens = tokens - self.ids = tokens + self.ids = [Bm25.compute_token_id(t) for t in tokens] self.attention_mask = [1] * len(tokens) - result.append(SimpleEncoding(tokens)) # type: ignore[arg-type] - return result # type: ignore[return-value] + result.append(SimpleEncoding(tokens)) + return result

Also applies to: 140-158

🤖 Prompt for AI Agents

In fastembed/sparse/bm25.py around lines 10 and 140-158, the function currently claims to return list[Encoding] but the SimpleEncoding class assigns string tokens to self.ids (violating Encoding.ids: list[int]) and suppresses type errors with type: ignore; fix by not using tokenizers.Encoding as the return type: define a lightweight custom Protocol or dataclass (e.g., SimpleTokenEncoding) with fields matching how BM25 uses them (tokens: list[str], offsets, attention_mask, etc.), update the function signature to return List[SimpleTokenEncoding], remove the type: ignore comments, and adjust any downstream code to use the new type; alternatively, if you must keep the Encoding type, convert/hash tokens to integer IDs and ensure Encoding.ids is list[int] and all Encoding interface fields are implemented consistently.

tests/test_late_interaction_embeddings.py

joein · 2025-10-29T15:53:25Z

fastembed/sparse/bm25.py

+            class SimpleEncoding:
+                def __init__(self, tokens: list[str]):
+                    self.tokens = tokens
+                    self.ids = tokens  # For BM25, tokens are the IDs
+                    self.attention_mask = [1] * len(tokens)


not sure yet if we want to mimic Encoding here
but we definitely don't want to create a class in a loop

fastembed/rerank/cross_encoder/onnx_text_model.py

fastembed/text/onnx_embedding.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

fastembed/late_interaction/colbert.py (1)
86-93: Consider forwarding kwargs for consistency.

The **kwargs parameter is declared but not forwarded to the internal _tokenize_documents and _tokenize_query methods. While these methods don't currently use kwargs, forwarding them would maintain consistency with other tokenize implementations in this PR and enable future customization without changing the signature.

If you'd like to forward kwargs for consistency, apply this diff:
 def _tokenize(
         self, documents: list[str], is_doc: bool = True, **kwargs: Any
     ) -> list[Encoding]:
         return (
-            self._tokenize_documents(documents=documents)
+            self._tokenize_documents(documents=documents, **kwargs)
             if is_doc
-            else self._tokenize_query(query=next(iter(documents)))
+            else self._tokenize_query(query=next(iter(documents)), **kwargs)
         )
And update the helper methods:
-def _tokenize_query(self, query: str) -> list[Encoding]:
+def _tokenize_query(self, query: str, **kwargs: Any) -> list[Encoding]:
     assert self.query_tokenizer is not None
     encoded = self.query_tokenizer.encode_batch([query])
     return encoded

-def _tokenize_documents(self, documents: list[str]) -> list[Encoding]:
+def _tokenize_documents(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
     encoded = self.tokenizer.encode_batch(documents)  # type: ignore[union-attr]
     return encoded
fastembed/sparse/sparse_embedding_base.py (1)
47-48: Add docstring and tests for the new public API method.

The tokenize method establishes a public API interface but lacks documentation. Consider adding a docstring that explains the expected parameters, return value structure, and intended usage pattern for when implementations become available.

Additionally, the PR checklist indicates no tests were added for this feature. While the current implementation is a stub, adding tests that verify the NotImplementedError behavior would establish a baseline for future implementations.

Apply this diff to add a docstring:
 def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
+    """
+    Tokenizes a list of documents.
+    
+    Args:
+        documents (list[str]): The list of documents to tokenize.
+        **kwargs: Additional keyword arguments for tokenization.
+    
+    Returns:
+        dict[str, Any]: Tokenized representation of the documents.
+    
+    Raises:
+        NotImplementedError: Tokenization is not yet implemented for sparse embeddings.
+    """
     raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c1e853 and 7bb1654.

📒 Files selected for processing (21)

fastembed/late_interaction/colbert.py (1 hunks)
fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
fastembed/late_interaction/late_interaction_text_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/colpali.py (1 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1 hunks)
fastembed/rerank/cross_encoder/onnx_text_model.py (1 hunks)
fastembed/sparse/bm25.py (1 hunks)
fastembed/sparse/bm42.py (1 hunks)
fastembed/sparse/minicoil.py (1 hunks)
fastembed/sparse/sparse_embedding_base.py (1 hunks)
fastembed/sparse/sparse_text_embedding.py (2 hunks)
fastembed/sparse/splade_pp.py (2 hunks)
fastembed/text/onnx_embedding.py (2 hunks)
fastembed/text/onnx_text_model.py (1 hunks)
fastembed/text/text_embedding.py (2 hunks)
fastembed/text/text_embedding_base.py (2 hunks)
tests/test_late_interaction_embeddings.py (2 hunks)
tests/test_sparse_embeddings.py (1 hunks)
tests/test_text_onnx_embeddings.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (11)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
fastembed/sparse/splade_pp.py
tests/test_sparse_embeddings.py
fastembed/text/onnx_embedding.py
tests/test_text_onnx_embeddings.py
fastembed/sparse/bm42.py
fastembed/text/text_embedding_base.py
fastembed/sparse/bm25.py
tests/test_late_interaction_embeddings.py
fastembed/late_interaction/late_interaction_embedding_base.py

🧰 Additional context used

🧬 Code graph analysis (10)

fastembed/sparse/sparse_embedding_base.py (6)

fastembed/sparse/bm25.py (1)

tokenize (139-140)

fastembed/sparse/bm42.py (1)

tokenize (133-134)

fastembed/sparse/minicoil.py (1)

tokenize (141-142)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (95-96)

fastembed/sparse/splade_pp.py (1)

tokenize (139-140)

fastembed/sparse/utils/tokenizer.py (1)

tokenize (8-12)

fastembed/late_interaction/colbert.py (6)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (2)

tokenize (163-164)

_tokenize (166-175)

fastembed/text/onnx_embedding.py (1)

tokenize (324-325)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

_tokenize (83-86)

fastembed/text/onnx_text_model.py (1)

_tokenize (71-72)

fastembed/text/text_embedding.py (7)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-164)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

fastembed/text/onnx_embedding.py (1)

tokenize (324-325)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)

fastembed/late_interaction/colbert.py (1)

_tokenize (86-93)

fastembed/late_interaction_multimodal/colpali.py (1)

_tokenize (166-175)

fastembed/text/onnx_text_model.py (1)

_tokenize (71-72)

fastembed/common/onnx_model.py (1)

OnnxOutputContext (20-23)

fastembed/late_interaction_multimodal/colpali.py (3)

fastembed/late_interaction/colbert.py (1)

_tokenize (86-93)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

_tokenize (83-86)

fastembed/text/onnx_text_model.py (1)

_tokenize (71-72)

fastembed/late_interaction/late_interaction_text_embedding.py (4)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

fastembed/sparse/minicoil.py (5)

fastembed/sparse/bm25.py (1)

tokenize (139-140)

fastembed/sparse/bm42.py (1)

tokenize (133-134)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (47-48)

fastembed/sparse/sparse_text_embedding.py (1)

tokenize (95-96)

fastembed/sparse/splade_pp.py (1)

tokenize (139-140)

fastembed/sparse/sparse_text_embedding.py (2)

fastembed/common/model_description.py (1)

SparseModelDescription (44-46)

fastembed/sparse/sparse_embedding_base.py (1)

tokenize (47-48)

fastembed/text/onnx_text_model.py (4)

fastembed/late_interaction/colbert.py (1)

_tokenize (86-93)

fastembed/late_interaction_multimodal/colpali.py (1)

_tokenize (166-175)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

_tokenize (83-86)

fastembed/common/onnx_model.py (2)

onnx_embed (119-120)

OnnxOutputContext (20-23)

fastembed/rerank/cross_encoder/onnx_text_model.py (6)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-164)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/text/onnx_embedding.py (1)

tokenize (324-325)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

🪛 Ruff (0.14.3)

fastembed/late_interaction/colbert.py

87-87: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction_multimodal/onnx_multimodal_model.py

85-85: Avoid specifying long messages outside the exception class

(TRY003)

fastembed/late_interaction_multimodal/colpali.py

166-166: Unused method argument: kwargs

(ARG002)

fastembed/text/onnx_text_model.py

71-71: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.9.x on windows-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test

🔇 Additional comments (4)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

119-130: LGTM!

The tokenize method implementation is clean and consistent with the pattern established across other embedding classes in this PR.

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

48-49: LGTM!

The change properly forwards **kwargs to encode_batch, enabling customizable tokenization behavior for reranking pairs.

fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)

83-86: LGTM!

The _tokenize method properly guards against uninitialized tokenizer and correctly forwards **kwargs to encode_batch.

fastembed/sparse/sparse_text_embedding.py (1)

1-7: ****

The imports are not accidentally included. The git diff shows they were reorganized from their original positions in main (where they were at the bottom of the imports section) to follow PEP 8 style conventions (stdlib imports first, then project imports). Both imports are actively used: warnings on line 67 for a deprecation warning, and SparseModelDescription on lines 48-49 in the _list_supported_models method type hints. This reorganization is a legitimate cleanup that accompanies the tokenize method addition.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-11-10T08:58:43Z

fastembed/late_interaction_multimodal/colpali.py

+    def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
        texts_query: list[str] = []
        for query in documents:
            query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10
            query += "\n"

            texts_query.append(query)
-        encoded = self.tokenizer.encode_batch(texts_query)  # type: ignore[union-attr]
+        assert self.tokenizer is not None
+        encoded = self.tokenizer.encode_batch(texts_query)
        return encoded


⚠️ Potential issue | 🟠 Major

Forward kwargs to enable tokenization customization.

The **kwargs parameter is declared but not forwarded to encode_batch at line 174. This prevents callers from customizing tokenization behavior. Other implementations in this PR forward kwargs (e.g., onnx_multimodal_model.py).

Apply this diff:

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: texts_query: list[str] = [] for query in documents: query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10 query += "\n" texts_query.append(query) assert self.tokenizer is not None - encoded = self.tokenizer.encode_batch(texts_query) + encoded = self.tokenizer.encode_batch(texts_query, **kwargs) return encoded

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:

texts_query: list[str] = []

for query in documents:

query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10

query += "\n"

texts_query.append(query)

encoded = self.tokenizer.encode_batch(texts_query) # type: ignore[union-attr]

assert self.tokenizer is not None

encoded = self.tokenizer.encode_batch(texts_query)

return encoded

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:

texts_query: list[str] = []

for query in documents:

query = self.BOS_TOKEN + self.QUERY_PREFIX + query + self.PAD_TOKEN * 10

query += "\n"

texts_query.append(query)

assert self.tokenizer is not None

encoded = self.tokenizer.encode_batch(texts_query, **kwargs)

return encoded

🧰 Tools

🪛 Ruff (0.14.3)

166-166: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents

In fastembed/late_interaction_multimodal/colpali.py around lines 166 to 175, the method _tokenize accepts **kwargs but does not pass them to self.tokenizer.encode_batch; update the call to forward kwargs (e.g., self.tokenizer.encode_batch(texts_query, **kwargs)) while keeping the existing assertion that tokenizer is not None so callers can customize tokenization options like truncation, padding, or return_tensors.

coderabbitai · 2025-11-10T08:58:43Z

fastembed/sparse/sparse_text_embedding.py

+    def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
+        raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")


🛠️ Refactor suggestion | 🟠 Major

Remove duplicate implementation or delegate to self.model.

The tokenize method duplicates the base class implementation exactly. Since SparseTextEmbedding follows a delegation pattern for other methods (embed delegates to self.model.embed(), query_embed delegates to self.model.query_embed()), the tokenize method should either:

Preferred: Delegate to self.model.tokenize() for consistency with the existing pattern, allowing concrete implementations to provide their own tokenization logic in the future.

Alternative: Remove this override entirely and inherit the base class implementation.

The current implementation breaks the delegation pattern and creates unnecessary duplication.

Option 1 (Preferred): Delegate to self.model

def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]: - raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.") + """ + Tokenizes a list of documents. + + Args: + documents (list[str]): The list of documents to tokenize. + **kwargs: Additional keyword arguments for tokenization. + + Returns: + dict[str, Any]: Tokenized representation of the documents. + """ + return self.model.tokenize(documents, **kwargs)

Option 2 (Alternative): Remove the override

-def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]: - raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.") -

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In fastembed/sparse/sparse_text_embedding.py around lines 95-96, the tokenize override duplicates the base class implementation and breaks the delegation pattern; change the method to delegate to the underlying model by returning self.model.tokenize(documents, **kwargs) (preferred), or remove this override entirely so the class inherits the base implementation; ensure to pass through kwargs and return the result unchanged.

coderabbitai · 2025-11-10T08:58:43Z

fastembed/text/onnx_text_model.py

+    def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
+        return self.tokenizer.encode_batch(documents)  # type:ignore[union-attr]


⚠️ Potential issue | 🟠 Major

Forward kwargs to enable tokenization customization.

The **kwargs parameter is declared but not forwarded to encode_batch. This breaks customization options that callers might pass through the tokenization chain. Other implementations in this PR (e.g., onnx_multimodal_model.py, onnx_text_model.py in rerank) forward kwargs.

Apply this diff:

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]: - return self.tokenizer.encode_batch(documents) # type:ignore[union-attr] + return self.tokenizer.encode_batch(documents, **kwargs) # type:ignore[union-attr]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:

return self.tokenizer.encode_batch(documents) # type:ignore[union-attr]

def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:

return self.tokenizer.encode_batch(documents, **kwargs) # type:ignore[union-attr]

🧰 Tools

🪛 Ruff (0.14.3)

71-71: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents

In fastembed/text/onnx_text_model.py around lines 71 to 72, the _tokenize method accepts **kwargs but does not forward them to tokenizer.encode_batch; update the return to call self.tokenizer.encode_batch(documents, **kwargs) (preserving the existing type-ignore if needed) so callers can customize tokenization options and match other implementations in this PR.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

fastembed/text/text_embedding_base.py (1)

23-24: Consider alignment with existing abstract method patterns.

The new tokenize method includes a descriptive message in NotImplementedError, while other abstract methods in this class (lines 33, 70) raise NotImplementedError() without a message. For consistency, consider either:

Removing the message: raise NotImplementedError()

Adding the message to other abstract methods

Additionally, consider adding a docstring to document the expected behavior for subclasses, similar to the docstrings in concrete implementations like TextEmbedding.tokenize (fastembed/text/text_embedding.py, lines 165-176).

fastembed/late_interaction/late_interaction_embedding_base.py (1)

24-25: Consider alignment with existing abstract method patterns.

Similar to the comment on TextEmbeddingBase, the new tokenize method includes a descriptive message in NotImplementedError, while the embed method (line 34) raises NotImplementedError() without a message. Consider standardizing the approach across abstract methods in this class.

Additionally, consider adding a docstring to document the expected behavior for subclasses, matching the pattern used in concrete implementations like LateInteractionTextEmbedding.tokenize (fastembed/late_interaction/late_interaction_text_embedding.py, lines 118-129).

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

25-26: Consider alignment with existing patterns and verify test coverage.

Similar to the other base classes, the new tokenize method includes a descriptive message in NotImplementedError, while embed_text and embed_image methods (lines 50, 72) raise NotImplementedError() without messages. Consider standardizing the approach.

Additionally, consider adding a docstring matching the pattern in LateInteractionMultimodalEmbedding.tokenize (fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py, lines 121-132).

Test coverage: The PR checklist indicates "Added tests for the feature: ⛔", but the AI summary mentions tests were added in tests/test_late_interaction_embeddings.py and tests/test_late_interaction_multimodal.py. Please verify that the new public tokenize API has adequate test coverage across all embedding types.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bb1654 and fe6d0dd.

📒 Files selected for processing (3)

fastembed/late_interaction/late_interaction_embedding_base.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2 hunks)
fastembed/text/text_embedding_base.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

fastembed/text/text_embedding_base.py (9)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-164)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/text/onnx_embedding.py (1)

tokenize (324-325)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

fastembed/rerank/cross_encoder/onnx_text_model.py (1)

tokenize (48-49)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (6)

fastembed/late_interaction/late_interaction_embedding_base.py (1)

tokenize (24-25)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/colpali.py (1)

tokenize (163-164)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/late_interaction/late_interaction_embedding_base.py (7)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

tokenize (25-26)

fastembed/text/text_embedding_base.py (1)

tokenize (23-24)

fastembed/late_interaction/late_interaction_text_embedding.py (1)

tokenize (119-130)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

tokenize (122-133)

fastembed/late_interaction/colbert.py (1)

tokenize (83-84)

fastembed/text/onnx_embedding.py (1)

tokenize (324-325)

fastembed/text/text_embedding.py (1)

tokenize (166-177)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)

GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.12.x on windows-latest test
GitHub Check: Python 3.10.x on windows-latest test
GitHub Check: Python 3.9.x on windows-latest test
GitHub Check: Python 3.11.x on macos-latest test
GitHub Check: Python 3.10.x on macos-latest test
GitHub Check: Python 3.12.x on macos-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test
GitHub Check: Python 3.9.x on macos-latest test
GitHub Check: Python 3.11.x on windows-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.9.x on ubuntu-latest test

🔇 Additional comments (3)

fastembed/text/text_embedding_base.py (1)

2-2: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

3-3: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

3-3: LGTM! Import is necessary for the type hint.

The Encoding import from tokenizers is correctly added to support the return type annotation of the new tokenize method.

joein · 2025-11-10T17:39:29Z

We've decided not to expose tokenize, but just provide token_count

…566)

* chore: remove tokenize impl from sparse embedding * chore: remove tokenize test * fix: return type hint --------- Co-authored-by: George Panchuk <[email protected]>

…ed token counts; raise NotImplemented for sparse

…non-padded token counts; raise NotImplemented for sparse" This reverts commit a26aad6.

joein · 2025-12-07T17:20:55Z

closing in favour of #583

dancixx requested a review from joein October 21, 2025 15:37

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

fastembed/text/text_embedding.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

fastembed/sparse/splade_pp.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

fastembed/late_interaction_multimodal/colpali.py Show resolved Hide resolved

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

fastembed/late_interaction/token_embeddings.py Outdated Show resolved Hide resolved

fastembed/sparse/bm42.py Outdated Show resolved Hide resolved

fastembed/sparse/minicoil.py Outdated Show resolved Hide resolved

fastembed/text/onnx_embedding.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

fastembed/late_interaction_multimodal/onnx_multimodal_model.py Outdated Show resolved Hide resolved

fastembed/text/onnx_text_model.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

joein requested changes Oct 29, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

joein mentioned this pull request Nov 10, 2025

[Feature]: implement public tokenize method to SparseEmbedding #568

Open

joein self-requested a review November 10, 2025 09:56

dancixx and others added 13 commits November 18, 2025 13:58

feat: add public tokenize method for TextEmbedding class

d2611d6

feat: add tokenize method and tests

d7abe5f

fix: test errors

4f0ad84

fix: type override

cc431f3

fix: tokenization test

c4f2297

chore: list[str] instead of union

97b862c

fix: tpye issue

4045721

fix: tests

cecaf29

fix: ai recommendations

98f0fc8

fix: pr reivews

39b0d7f

refactor: split tokenize into _tokenize and tokenize to respect MRO (#…

7dba1d0

…566)

chore: remove tokenize impl from sparse embedding (#567)

b9d80a8

* chore: remove tokenize impl from sparse embedding * chore: remove tokenize test * fix: return type hint --------- Co-authored-by: George Panchuk <[email protected]>

chore: exception message

683eba2

dancixx force-pushed the feat/expose-tokenize-method branch from fe6d0dd to 683eba2 Compare November 18, 2025 13:08

dancixx added 2 commits November 18, 2025 14:32

feat: add public token_count across embedding models; return non-padd…

a26aad6

…ed token counts; raise NotImplemented for sparse

Revert "feat: add public token_count across embedding models; return …

42974dd

…non-padded token counts; raise NotImplemented for sparse" This reverts commit a26aad6.

joein closed this Dec 7, 2025

		def tokenize(self, documents: list[str], **kwargs: Any) -> dict[str, Any]:
		raise NotImplementedError("Tokenize method for sparse embeddings is not implemented yet.")

		def _tokenize(self, documents: list[str], **kwargs: Any) -> list[Encoding]:
		return self.tokenizer.encode_batch(documents) # type:ignore[union-attr]

feat: add public tokenize method for TextEmbedding class #564

feat: add public tokenize method for TextEmbedding class #564

Uh oh!

Conversation

dancixx commented Oct 21, 2025

All Submissions:

New Feature Submissions:

New models submission:

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joein Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

joein commented Nov 10, 2025

Uh oh!

joein commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

coderabbitai bot commented Oct 21, 2025 •

edited

Loading