Skip to content

Conversation

@joein
Copy link
Member

@joein joein commented Dec 1, 2025

This PR adds token_count methods to all text embeddings
The most tricky ones are colbert and colpali due to their internal preprocessings

Colbert requires a query to be at least 32 tokens, however, there might be different use cases and due to this reason
Colbert and Colpali has 2 modes - when preprocessing is included and when it's not

#584

@joein joein requested review from dancixx and tbung December 1, 2025 17:27
@coderabbitai
Copy link

coderabbitai bot commented Dec 1, 2025

Warning

Rate limit exceeded

@joein has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 3 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 21d4f45 and 5e9a426.

📒 Files selected for processing (5)
  • fastembed/late_interaction/colbert.py (2 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_model.py (1 hunks)
  • fastembed/sparse/bm42.py (1 hunks)
  • fastembed/text/onnx_text_model.py (1 hunks)
📝 Walkthrough

Walkthrough

This pull request adds a token_count API across many components: base classes, public wrappers, and concrete models for late-interaction, multimodal late-interaction, cross-encoders, sparse models, and ONNX text models. Concrete implementations batch inputs (via iter_batch), use the appropriate tokenizer/encoding functions, sum attention masks (or equivalent counts) and return total token counts. Tests exercising token counting were added for late-interaction, multimodal, sparse, cross-encoder, and ONNX text embeddings.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Changes are widely applied but follow a repeated delegation/implementation pattern, lowering cognitive load.
  • Areas needing closer review:
    • Concrete implementations in colbert.py, colpali.py, and ONNX model files (onnx_text_model.py, onnx_text_cross_encoder.py) to confirm correct tokenizer selection and attention_mask handling.
    • Consistency of parameter semantics across interfaces (e.g., is_doc, include_extension, pairs vs texts) and that defaults are sensible.
    • New tests to ensure they are stable and correctly exercise batching behavior.

Possibly related PRs

Suggested reviewers

  • tbung

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.32% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'new: add token count method' is vague and generic, lacking specificity about which components or the scope of changes. Consider a more descriptive title such as 'Add token_count method to text embedding classes' to better convey the change scope.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly relates to the changeset, explaining that token_count methods are added to text embeddings with special handling for colbert and colpali.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
fastembed/late_interaction/colbert.py (1)

297-330: Remove debug scaffolding before merge.

The if __name__ == "__main__" block contains development/testing code that should be removed before merging. This includes commented-out code and print statements used for local debugging.

-
-
-if __name__ == "__main__":
-    te = Colbert("answerdotai/answerai-colbert-small-v1")
-
-    print(
-        te.token_count(
-            texts=[
-                "qwe",
-                "adsda ads qwe dasd dsad cxc czx as qw er tr y fg s a x  b h f d a s w e t y k,l; czcx asd ",
-            ]
-        )
-    )
-    print(
-        te.token_count(
-            texts=[
-                "qwe",
-                "adsda ads qwe dasd dsad cxc czx as qw er tr y fg s a x  b h f d a s w e t y k,l; czcx asd ",
-            ],
-            is_doc=False,
-        )
-    )
-    # data = []
-    # with open('../../training_data.csv', 'r') as f:
-    #     for i, line in enumerate(f):
-    #
-    #         if i == 0:
-    #             continue
-    #
-    #         data.append(line.rsplit(',', maxsplit=1)[0][1:-1])
-    # import time
-    # a = time.perf_counter()
-    # te.token_count(data, batch_size=1024)
-    # print(time.perf_counter() - a)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ba8990 and a38c309.

📒 Files selected for processing (25)
  • fastembed/late_interaction/colbert.py (3 hunks)
  • fastembed/late_interaction/late_interaction_embedding_base.py (1 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_model.py (1 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder.py (1 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1 hunks)
  • fastembed/sparse/bm25.py (1 hunks)
  • fastembed/sparse/bm42.py (1 hunks)
  • fastembed/sparse/minicoil.py (1 hunks)
  • fastembed/sparse/sparse_embedding_base.py (1 hunks)
  • fastembed/sparse/sparse_text_embedding.py (1 hunks)
  • fastembed/sparse/splade_pp.py (1 hunks)
  • fastembed/text/onnx_embedding.py (1 hunks)
  • fastembed/text/onnx_text_model.py (1 hunks)
  • fastembed/text/text_embedding.py (1 hunks)
  • fastembed/text/text_embedding_base.py (1 hunks)
  • tests/test_late_interaction_embeddings.py (1 hunks)
  • tests/test_late_interaction_multimodal.py (1 hunks)
  • tests/test_sparse_embeddings.py (1 hunks)
  • tests/test_text_cross_encoder.py (1 hunks)
  • tests/test_text_onnx_embeddings.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (7)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
  • token_count (210-220)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-175)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/sparse/sparse_text_embedding.py (7)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/minicoil.py (1)
  • token_count (190-191)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
tests/test_text_onnx_embeddings.py (3)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/sparse/splade_pp.py (7)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-175)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/minicoil.py (1)
  • token_count (190-191)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-169)
fastembed/text/text_embedding_base.py (5)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/late_interaction/late_interaction_embedding_base.py (10)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-125)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • token_count (155-175)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-87)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-175)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
  • token_count (61-63)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (4)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-175)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
  • token_count (61-63)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • _token_count (168-175)
fastembed/late_interaction/colbert.py (2)
fastembed/common/utils.py (1)
  • iter_batch (35-45)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/rerank/cross_encoder/onnx_text_model.py (2)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-169)
fastembed/common/utils.py (1)
  • iter_batch (35-45)
tests/test_late_interaction_embeddings.py (6)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-125)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • token_count (155-175)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • token_count (166-184)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
tests/test_sparse_embeddings.py (9)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/minicoil.py (1)
  • token_count (190-191)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (9)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-125)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/late_interaction_multimodal/colpali.py (1)
  • token_count (175-186)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-87)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/late_interaction/late_interaction_text_embedding.py (10)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-125)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/late_interaction_multimodal/colpali.py (1)
  • token_count (175-186)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-87)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-175)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/onnx_embedding.py (4)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-169)
tests/test_late_interaction_multimodal.py (4)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-125)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-81)
fastembed/late_interaction_multimodal/colpali.py (1)
  • token_count (175-186)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-87)
fastembed/sparse/minicoil.py (7)
fastembed/sparse/bm42.py (1)
  • token_count (355-356)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-142)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-57)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-335)
fastembed/text/text_embedding.py (1)
  • token_count (216-226)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-169)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/colpali.py

[error] 182-182: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]


[error] 183-183: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]

fastembed/late_interaction/colbert.py

[error] 107-107: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]


[error] 110-110: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]

fastembed/rerank/cross_encoder/onnx_text_model.py

[error] 172-172: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]

fastembed/text/onnx_text_model.py

[error] 164-164: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]


[error] 166-166: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]

fastembed/sparse/bm25.py

[error] 271-271: mypy: Signature of 'token_count' incompatible with supertype SparseTextEmbeddingBase (override)


[error] 273-273: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]

🪛 Ruff (0.14.6)
fastembed/sparse/bm25.py

271-271: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (20)
fastembed/late_interaction/late_interaction_text_embedding.py (1)

155-175: LGTM!

The token_count method correctly delegates to the underlying model implementation and follows the same pattern as existing embed and query_embed methods. The docstring clearly explains the parameters, including the special is_doc and include_extension flags for Colbert-style models.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

80-87: LGTM!

The abstract token_count method is correctly defined with NotImplementedError. The signature appropriately omits is_doc since multimodal embeddings handle text differently than late-interaction text models.

fastembed/sparse/splade_pp.py (1)

56-58: LGTM!

The token_count method correctly delegates to _token_count from the OnnxTextModel parent class, consistent with the pattern used in Bm42 and MiniCOIL.

fastembed/text/text_embedding_base.py (1)

73-75: LGTM!

The abstract token_count method follows the same pattern as other abstract methods in this base class (get_embedding_size, embedding_size) and is correctly defined with NotImplementedError.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

73-81: LGTM!

The abstract method signature correctly mirrors the concrete implementation in Colbert.token_count, with appropriate parameters for handling documents vs queries (is_doc) and optional preprocessing tokens (include_extension).

fastembed/sparse/bm42.py (1)

355-356: LGTM!

Clean delegation to the inherited _token_count method with a signature matching the base class.

fastembed/text/onnx_embedding.py (1)

334-335: LGTM!

Consistent delegation pattern matching the base class signature.

tests/test_late_interaction_multimodal.py (1)

106-121: LGTM!

Good test coverage validating:

  • Token count additivity across documents
  • Batching consistency with batch_size=1
  • include_extension=True correctly increases the token count

The test follows the established patterns in this file.

tests/test_text_onnx_embeddings.py (1)

208-219: LGTM!

The test properly validates token counting behavior by verifying that individual document token counts sum to the total, and that batch_size variations produce consistent results.

tests/test_sparse_embeddings.py (1)

303-322: LGTM!

The test correctly validates token counting across multiple sparse embedding models using the same sound approach: individual counts sum to total, with batch size consistency checks.

fastembed/text/text_embedding.py (1)

216-226: LGTM!

The token_count method follows the established delegation pattern, has clear documentation, and maintains consistency with the codebase's default batch size of 1024.

fastembed/rerank/cross_encoder/text_cross_encoder.py (1)

165-175: LGTM!

The token_count method appropriately handles query-document pairs for cross-encoders, following the established delegation pattern with clear documentation.

fastembed/sparse/sparse_embedding_base.py (1)

90-92: LGTM!

The abstract token_count method correctly defines the interface contract for sparse embedding subclasses.

fastembed/sparse/sparse_text_embedding.py (1)

132-142: LGTM!

The token_count implementation properly fulfills the abstract base class contract with clear documentation and consistent delegation.

fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)

210-220: LGTM!

The token_count method correctly delegates to the internal _token_count implementation with appropriate documentation for cross-encoder pair tokenization.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

166-184: LGTM!

The token_count method appropriately includes the include_extension parameter for late-interaction models, allowing users to control whether preprocessing tokens are counted. The implementation and documentation are clear and consistent.

fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)

60-63: LGTM!

The abstract method declaration follows the established pattern in the codebase and is consistent with similar abstract methods in SparseTextEmbeddingBase and TextEmbeddingBase. The signature correctly uses Iterable[tuple[str, str]] to match the cross-encoder's pair-based API.

tests/test_text_cross_encoder.py (1)

127-142: LGTM!

The test correctly validates token counting behavior for cross-encoder pairs, checking both additivity across pairs and batching consistency. The test follows the established patterns in this file.

fastembed/sparse/minicoil.py (1)

190-191: LGTM!

The implementation correctly delegates to the inherited _token_count method from OnnxTextModel, following the same pattern used in BM42 and SpladePP.

tests/test_late_interaction_embeddings.py (1)

321-344: LGTM!

Comprehensive test coverage for the token_count method, validating:

  • Additivity of token counts across documents
  • Extension tokens (+1 marker per document)
  • Batching consistency
  • Query mode with minimum length enforcement (32 tokens)

The test cases effectively exercise the is_doc and include_extension parameters.

Comment on lines 271 to 278
def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int:
token_num = 0
texts = texts if isinstance(texts, list) else [texts]
for text in texts:
document = remove_non_alphanumeric(text)
tokens = self.tokenizer.tokenize(document)
token_num += len(tokens)
return token_num
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix signature incompatibility and type handling.

Two issues flagged by the pipeline:

  1. Signature mismatch: The method signature **kwargs doesn't match the base class SparseTextEmbeddingBase.token_count(texts, batch_size=1024). The kwargs parameter is also unused.

  2. Type error on line 273: When texts is an Iterable[str] (but not a list), wrapping it as [texts] creates a nested structure instead of a list of strings.

Apply this diff to fix both issues:

-    def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int:
+    def token_count(self, texts: Union[str, Iterable[str]], batch_size: int = 1024) -> int:
         token_num = 0
-        texts = texts if isinstance(texts, list) else [texts]
+        if isinstance(texts, str):
+            texts = [texts]
+        elif not isinstance(texts, list):
+            texts = list(texts)
         for text in texts:
             document = remove_non_alphanumeric(text)
             tokens = self.tokenizer.tokenize(document)
             token_num += len(tokens)
         return token_num

Note: If batch_size should be used for batching (consistent with other implementations), consider using iter_batch as done in other token_count implementations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int:
token_num = 0
texts = texts if isinstance(texts, list) else [texts]
for text in texts:
document = remove_non_alphanumeric(text)
tokens = self.tokenizer.tokenize(document)
token_num += len(tokens)
return token_num
def token_count(self, texts: Union[str, Iterable[str]], batch_size: int = 1024) -> int:
token_num = 0
if isinstance(texts, str):
texts = [texts]
elif not isinstance(texts, list):
texts = list(texts)
for text in texts:
document = remove_non_alphanumeric(text)
tokens = self.tokenizer.tokenize(document)
token_num += len(tokens)
return token_num
🧰 Tools
🪛 GitHub Actions: type-checkers

[error] 271-271: mypy: Signature of 'token_count' incompatible with supertype SparseTextEmbeddingBase (override)


[error] 273-273: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]

🪛 Ruff (0.14.6)

271-271: Unused method argument: kwargs

(ARG002)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
fastembed/sparse/bm25.py (1)

271-278: Silence unused kwargs in Bm25.token_count while keeping the flexible signature

token_count correctly handles both a single string and an iterable of strings now, but **kwargs is never used and Ruff flags it (ARG002). Since this method still needs to accept extra options (e.g. batch_size) for compatibility with the base class and wrapper APIs, a simple rename is enough:

-    def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int:
+    def token_count(self, texts: Union[str, Iterable[str]], **_: Any) -> int:
         token_num = 0
         texts = [texts] if isinstance(texts, str) else texts
         for text in texts:
             document = remove_non_alphanumeric(text)
             tokens = self.tokenizer.tokenize(document)
             token_num += len(tokens)
         return token_num

This keeps the call surface flexible while making the linter happy.

fastembed/late_interaction_multimodal/colpali.py (1)

175-188: Make token_count robust to lazy loading and avoid relying on assert for self.tokenizer.

Two follow-ups here:

  1. Tokenizer initialization / lazy-load interaction (functional/robustness issue)
    • With lazy_load=True, load_onnx_model is not called in __init__, so self.tokenizer may not exist or may be None when token_count is invoked before any embedding call.
    • assert self.tokenizer is not None does not protect against the attribute being missing, and it is stripped when running Python with optimizations (python -O), which can reintroduce the None-dereference.
    • Suggest guarding explicitly and either lazily loading or failing fast with a clear error before using encode_batch or tokenize:
     def token_count(
         self,
         texts: Union[str, Iterable[str]],
         batch_size: int = 1024,
         include_extension: bool = False,
         **kwargs: Any,
     ) -> int:
-        token_num = 0
-        texts = [texts] if isinstance(texts, str) else texts
-        assert self.tokenizer is not None
-        tokenize_func = self.tokenize if include_extension else self.tokenizer.encode_batch
-        for batch in iter_batch(texts, batch_size):
-            token_num += sum([sum(encoding.attention_mask) for encoding in tokenize_func(batch)])
-        return token_num
+        token_num = 0
+        texts_iter: Iterable[str] = [texts] if isinstance(texts, str) else texts
+
+        # Ensure tokenizer is initialized for both branches (include_extension uses self.tokenize).
+        if getattr(self, "tokenizer", None) is None:
+            # Option A: eager load
+            self.load_onnx_model()
+            # Option B (if you prefer explicit failure):
+            # raise RuntimeError("Tokenizer not initialized; call `load_onnx_model` or embed once before `token_count`.")
+
+        assert self.tokenizer is not None
+        tokenize_func = (
+            self.tokenize
+            if include_extension
+            else self.tokenizer.encode_batch  # type: ignore[union-attr]
+        )
+        for batch in iter_batch(texts_iter, batch_size):
+            token_num += sum(
+                sum(encoding.attention_mask) for encoding in tokenize_func(batch)
+            )
+        return token_num

This keeps mypy happy via the assert but also makes the runtime behavior explicit and safer when used with lazy loading.

  1. Minor cleanups (optional)
    • Ruff’s ARG002 warning on kwargs can be silenced without changing behavior by e.g. _ = kwargs at the top of the method, or by renaming to _kwargs: Any.
    • The inner list in sum([sum(... ) for ...]) is unnecessary; the generator form shown above is marginally more efficient and matches patterns in the rest of the codebase.
🧹 Nitpick comments (1)
fastembed/late_interaction/colbert.py (1)

99-127: Unused kwargs parameter and logic clarification.

The kwargs parameter is accepted but not used (flagged by static analysis). This appears intentional for API consistency, but consider either:

  1. Prefixing with underscore (**_kwargs) to indicate intentional non-use
  2. Adding a brief comment explaining the parameter is for API compatibility

Otherwise, the implementation logic is correct—document mode sums attention masks directly, while query mode applies MIN_QUERY_LENGTH padding when include_extension=True.

     def token_count(
         self,
         texts: Union[str, Iterable[str]],
         batch_size: int = 1024,
         is_doc: bool = True,
         include_extension: bool = False,
-        **kwargs: Any,
+        **_: Any,
     ) -> int:
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a38c309 and 21d4f45.

📒 Files selected for processing (20)
  • fastembed/late_interaction/colbert.py (3 hunks)
  • fastembed/late_interaction/late_interaction_embedding_base.py (1 hunks)
  • fastembed/late_interaction/late_interaction_text_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1 hunks)
  • fastembed/rerank/cross_encoder/onnx_text_model.py (1 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder.py (1 hunks)
  • fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1 hunks)
  • fastembed/sparse/bm25.py (1 hunks)
  • fastembed/sparse/bm42.py (1 hunks)
  • fastembed/sparse/minicoil.py (1 hunks)
  • fastembed/sparse/sparse_embedding_base.py (1 hunks)
  • fastembed/sparse/sparse_text_embedding.py (1 hunks)
  • fastembed/sparse/splade_pp.py (1 hunks)
  • fastembed/text/onnx_embedding.py (1 hunks)
  • fastembed/text/onnx_text_model.py (1 hunks)
  • fastembed/text/text_embedding.py (1 hunks)
  • fastembed/text/text_embedding_base.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • fastembed/rerank/cross_encoder/text_cross_encoder_base.py
  • fastembed/text/onnx_text_model.py
🧰 Additional context used
🧬 Code graph analysis (8)
fastembed/late_interaction/late_interaction_embedding_base.py (10)
fastembed/late_interaction/colbert.py (1)
  • token_count (99-127)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • token_count (155-180)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-86)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-177)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
  • token_count (61-63)
fastembed/sparse/bm42.py (1)
  • token_count (355-358)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-59)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-337)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/sparse/splade_pp.py (6)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-337)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • _token_count (168-177)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (11)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-80)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • token_count (155-180)
fastembed/late_interaction_multimodal/colpali.py (1)
  • token_count (175-188)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • token_count (166-185)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
  • token_count (210-222)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-177)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/sparse_embedding_base.py (1)
  • token_count (90-92)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-144)
fastembed/text/text_embedding.py (1)
  • token_count (216-228)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/sparse/sparse_embedding_base.py (12)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-80)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • token_count (155-180)
fastembed/late_interaction_multimodal/colpali.py (1)
  • token_count (175-188)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-86)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
  • token_count (210-222)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-177)
fastembed/sparse/bm25.py (1)
  • token_count (271-278)
fastembed/sparse/bm42.py (1)
  • token_count (355-358)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-144)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-59)
fastembed/text/text_embedding.py (1)
  • token_count (216-228)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (8)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-177)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
  • token_count (61-63)
fastembed/sparse/bm42.py (1)
  • token_count (355-358)
fastembed/sparse/minicoil.py (1)
  • token_count (190-193)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-59)
fastembed/text/text_embedding.py (1)
  • token_count (216-228)
fastembed/rerank/cross_encoder/onnx_text_model.py (1)
  • _token_count (168-177)
fastembed/text/onnx_text_model.py (1)
  • _token_count (162-172)
fastembed/late_interaction_multimodal/colpali.py (3)
fastembed/common/utils.py (1)
  • iter_batch (35-45)
fastembed/late_interaction/colbert.py (2)
  • token_count (99-127)
  • tokenize (83-88)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • token_count (80-86)
fastembed/text/text_embedding.py (7)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
  • token_count (210-222)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
  • token_count (165-177)
fastembed/sparse/bm42.py (1)
  • token_count (355-358)
fastembed/sparse/sparse_text_embedding.py (1)
  • token_count (132-144)
fastembed/sparse/splade_pp.py (1)
  • token_count (56-59)
fastembed/text/onnx_embedding.py (1)
  • token_count (334-337)
fastembed/text/text_embedding_base.py (1)
  • token_count (73-75)
fastembed/late_interaction/colbert.py (3)
fastembed/common/utils.py (1)
  • iter_batch (35-45)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • token_count (73-80)
fastembed/text/text_embedding.py (1)
  • token_count (216-228)
🪛 Ruff (0.14.6)
fastembed/late_interaction_multimodal/colpali.py

180-180: Unused method argument: kwargs

(ARG002)

fastembed/late_interaction/colbert.py

105-105: Unused method argument: kwargs

(ARG002)

fastembed/sparse/bm25.py

271-271: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Python 3.9.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (14)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)

165-177: Delegated token_count is consistent with existing cross‑encoder API

The implementation cleanly mirrors rerank/rerank_pairs, forwarding pairs, batch_size, and **kwargs to the underlying model. To avoid runtime surprises, double‑check that every class in CROSS_ENCODER_REGISTRY implements a compatible token_count method.

fastembed/text/text_embedding.py (1)

216-228: Public token_count on TextEmbedding matches sparse/ONNX patterns

This delegator keeps the high‑level API consistent with sparse embeddings and ONNX text models by forwarding texts, batch_size, and **kwargs to the underlying model. Please confirm that every class in EMBEDDINGS_REGISTRY implements token_count so this wrapper never falls back to the base NotImplementedError.

fastembed/sparse/minicoil.py (1)

190-193: MiniCOIL token_count wiring is straightforward and consistent

The method cleanly forwards to _token_count with texts, batch_size, and **kwargs, matching the pattern used in other ONNX‑backed sparse embedders.

fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)

210-222: ONNX cross‑encoder token_count is correctly delegated to _token_count

This method aligns with the abstract TextCrossEncoderBase.token_count and simply routes pairs and batch_size into the shared _token_count implementation. As long as tests cover a few representative models, this looks good.

fastembed/text/text_embedding_base.py (1)

73-75: Base token_count contract is clear and matches sparse base classes

Adding token_count(self, texts, **kwargs) with a NotImplementedError formalizes the interface for text embeddings and lines up with the sparse base class. Just ensure every concrete TextEmbeddingBase subclass now overrides this, otherwise callers may hit the base implementation at runtime.

fastembed/sparse/splade_pp.py (1)

56-59: SpladePP token_count follows the shared ONNX text model pattern

Delegating directly to _token_count with texts, batch_size, and **kwargs keeps SpladePP consistent with other sparse ONNX models and avoids duplicating tokenization logic.

fastembed/sparse/bm42.py (1)

355-358: Bm42 token_count is correctly hooked into the shared ONNX tokenizer path

The method simply forwards texts, batch_size, and **kwargs to _token_count, matching the pattern in other sparse ONNX models. With tests exercising sparse token_count, this should behave as expected.

fastembed/text/onnx_embedding.py (1)

334-338: LGTM!

The token_count method correctly delegates to the inherited _token_count implementation from OnnxTextModel, following the consistent delegation pattern used across other embedding classes in this PR.

fastembed/late_interaction/late_interaction_embedding_base.py (1)

72-80: LGTM!

The abstract token_count method establishes a consistent interface for subclasses, with a signature that aligns with other base classes like TextEmbeddingBase and SparseTextEmbeddingBase.

fastembed/late_interaction/late_interaction_text_embedding.py (1)

154-180: LGTM!

The token_count method correctly delegates to the underlying model with appropriate late-interaction-specific parameters (is_doc, include_extension). The docstring clearly explains the purpose of each parameter.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

165-185: LGTM!

The token_count method appropriately delegates to the underlying ColPali model with the include_extension parameter, which controls whether preprocessing tokens are counted. The signature correctly omits the is_doc parameter since multimodal embeddings handle text differently than pure late-interaction text models.

fastembed/sparse/sparse_text_embedding.py (1)

131-144: LGTM!

The token_count method follows the established delegation pattern, forwarding the call to the underlying sparse embedding model implementation. This is consistent with the approach used in other embedding wrapper classes.

fastembed/sparse/sparse_embedding_base.py (1)

89-92: LGTM!

The abstract token_count method provides a consistent interface contract for sparse embedding subclasses, matching the pattern used in TextEmbeddingBase and other base classes where batch_size is passed via **kwargs.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

80-86: Abstract token_count interface looks consistent and future-proof.

The new abstract token_count mirrors the other base classes (accepts texts plus **kwargs) and gives concrete multimodal models room to support batch_size, include_extension, etc., while keeping the public contract clear. No issues from an interface/typing standpoint.

@joein joein merged commit b718cc6 into main Dec 4, 2025
17 checks passed
@joein joein deleted the add-token-count branch December 4, 2025 04:29
kacperlukawski pushed a commit to kacperlukawski/fastembed that referenced this pull request Dec 15, 2025
* new: add token count method

* fix: fix mypy

* fix: load model in token_count

* fix: remove debug code
@coderabbitai coderabbitai bot mentioned this pull request Jan 9, 2026
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants