-
Notifications
You must be signed in to change notification settings - Fork 167
new: add token count method #583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@joein has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 3 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis pull request adds a Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (1)
fastembed/late_interaction/colbert.py (1)
297-330: Remove debug scaffolding before merge.The
if __name__ == "__main__"block contains development/testing code that should be removed before merging. This includes commented-out code and print statements used for local debugging.- - -if __name__ == "__main__": - te = Colbert("answerdotai/answerai-colbert-small-v1") - - print( - te.token_count( - texts=[ - "qwe", - "adsda ads qwe dasd dsad cxc czx as qw er tr y fg s a x b h f d a s w e t y k,l; czcx asd ", - ] - ) - ) - print( - te.token_count( - texts=[ - "qwe", - "adsda ads qwe dasd dsad cxc czx as qw er tr y fg s a x b h f d a s w e t y k,l; czcx asd ", - ], - is_doc=False, - ) - ) - # data = [] - # with open('../../training_data.csv', 'r') as f: - # for i, line in enumerate(f): - # - # if i == 0: - # continue - # - # data.append(line.rsplit(',', maxsplit=1)[0][1:-1]) - # import time - # a = time.perf_counter() - # te.token_count(data, batch_size=1024) - # print(time.perf_counter() - a)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
fastembed/late_interaction/colbert.py(3 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(1 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(1 hunks)fastembed/late_interaction_multimodal/colpali.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(1 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(1 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(1 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(1 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(1 hunks)fastembed/sparse/bm25.py(1 hunks)fastembed/sparse/bm42.py(1 hunks)fastembed/sparse/minicoil.py(1 hunks)fastembed/sparse/sparse_embedding_base.py(1 hunks)fastembed/sparse/sparse_text_embedding.py(1 hunks)fastembed/sparse/splade_pp.py(1 hunks)fastembed/text/onnx_embedding.py(1 hunks)fastembed/text/onnx_text_model.py(1 hunks)fastembed/text/text_embedding.py(1 hunks)fastembed/text/text_embedding_base.py(1 hunks)tests/test_late_interaction_embeddings.py(1 hunks)tests/test_late_interaction_multimodal.py(1 hunks)tests/test_sparse_embeddings.py(1 hunks)tests/test_text_cross_encoder.py(1 hunks)tests/test_text_onnx_embeddings.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (16)
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (7)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
token_count(210-220)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-175)fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/sparse/sparse_text_embedding.py (7)
fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/minicoil.py (1)
token_count(190-191)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
tests/test_text_onnx_embeddings.py (3)
fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/sparse/splade_pp.py (7)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-175)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/minicoil.py (1)
token_count(190-191)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/text_embedding_base.py (1)
token_count(73-75)fastembed/text/onnx_text_model.py (1)
_token_count(162-169)
fastembed/text/text_embedding_base.py (5)
fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding.py (1)
token_count(216-226)
fastembed/late_interaction/late_interaction_embedding_base.py (10)
fastembed/late_interaction/colbert.py (1)
token_count(99-125)fastembed/late_interaction/late_interaction_text_embedding.py (1)
token_count(155-175)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-87)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-175)fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
token_count(61-63)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (4)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-175)fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
token_count(61-63)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
_token_count(168-175)
fastembed/late_interaction/colbert.py (2)
fastembed/common/utils.py (1)
iter_batch(35-45)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)
fastembed/rerank/cross_encoder/onnx_text_model.py (2)
fastembed/text/onnx_text_model.py (1)
_token_count(162-169)fastembed/common/utils.py (1)
iter_batch(35-45)
tests/test_late_interaction_embeddings.py (6)
fastembed/late_interaction/colbert.py (1)
token_count(99-125)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)fastembed/late_interaction/late_interaction_text_embedding.py (1)
token_count(155-175)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
token_count(166-184)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)
tests/test_sparse_embeddings.py (9)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/minicoil.py (1)
token_count(190-191)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (9)
fastembed/late_interaction/colbert.py (1)
token_count(99-125)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)fastembed/late_interaction_multimodal/colpali.py (1)
token_count(175-186)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-87)fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/late_interaction/late_interaction_text_embedding.py (10)
fastembed/late_interaction/colbert.py (1)
token_count(99-125)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)fastembed/late_interaction_multimodal/colpali.py (1)
token_count(175-186)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-87)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-175)fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding.py (1)
token_count(216-226)
fastembed/text/onnx_embedding.py (4)
fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/text/text_embedding_base.py (1)
token_count(73-75)fastembed/text/onnx_text_model.py (1)
_token_count(162-169)
tests/test_late_interaction_multimodal.py (4)
fastembed/late_interaction/colbert.py (1)
token_count(99-125)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-81)fastembed/late_interaction_multimodal/colpali.py (1)
token_count(175-186)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-87)
fastembed/sparse/minicoil.py (7)
fastembed/sparse/bm42.py (1)
token_count(355-356)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-142)fastembed/sparse/splade_pp.py (1)
token_count(56-57)fastembed/text/onnx_embedding.py (1)
token_count(334-335)fastembed/text/text_embedding.py (1)
token_count(216-226)fastembed/text/onnx_text_model.py (1)
_token_count(162-169)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/colpali.py
[error] 182-182: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]
[error] 183-183: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]
fastembed/late_interaction/colbert.py
[error] 107-107: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]
[error] 110-110: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]
fastembed/rerank/cross_encoder/onnx_text_model.py
[error] 172-172: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]
fastembed/text/onnx_text_model.py
[error] 164-164: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]
[error] 166-166: mypy: Item 'None' of 'Any | None' has no attribute 'encode_batch' [union-attr]
fastembed/sparse/bm25.py
[error] 271-271: mypy: Signature of 'token_count' incompatible with supertype SparseTextEmbeddingBase (override)
[error] 273-273: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]
🪛 Ruff (0.14.6)
fastembed/sparse/bm25.py
271-271: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (20)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
155-175: LGTM!The
token_countmethod correctly delegates to the underlying model implementation and follows the same pattern as existingembedandquery_embedmethods. The docstring clearly explains the parameters, including the specialis_docandinclude_extensionflags for Colbert-style models.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
80-87: LGTM!The abstract
token_countmethod is correctly defined withNotImplementedError. The signature appropriately omitsis_docsince multimodal embeddings handle text differently than late-interaction text models.fastembed/sparse/splade_pp.py (1)
56-58: LGTM!The
token_countmethod correctly delegates to_token_countfrom theOnnxTextModelparent class, consistent with the pattern used inBm42andMiniCOIL.fastembed/text/text_embedding_base.py (1)
73-75: LGTM!The abstract
token_countmethod follows the same pattern as other abstract methods in this base class (get_embedding_size,embedding_size) and is correctly defined withNotImplementedError.fastembed/late_interaction/late_interaction_embedding_base.py (1)
73-81: LGTM!The abstract method signature correctly mirrors the concrete implementation in
Colbert.token_count, with appropriate parameters for handling documents vs queries (is_doc) and optional preprocessing tokens (include_extension).fastembed/sparse/bm42.py (1)
355-356: LGTM!Clean delegation to the inherited
_token_countmethod with a signature matching the base class.fastembed/text/onnx_embedding.py (1)
334-335: LGTM!Consistent delegation pattern matching the base class signature.
tests/test_late_interaction_multimodal.py (1)
106-121: LGTM!Good test coverage validating:
- Token count additivity across documents
- Batching consistency with
batch_size=1include_extension=Truecorrectly increases the token countThe test follows the established patterns in this file.
tests/test_text_onnx_embeddings.py (1)
208-219: LGTM!The test properly validates token counting behavior by verifying that individual document token counts sum to the total, and that batch_size variations produce consistent results.
tests/test_sparse_embeddings.py (1)
303-322: LGTM!The test correctly validates token counting across multiple sparse embedding models using the same sound approach: individual counts sum to total, with batch size consistency checks.
fastembed/text/text_embedding.py (1)
216-226: LGTM!The token_count method follows the established delegation pattern, has clear documentation, and maintains consistency with the codebase's default batch size of 1024.
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
165-175: LGTM!The token_count method appropriately handles query-document pairs for cross-encoders, following the established delegation pattern with clear documentation.
fastembed/sparse/sparse_embedding_base.py (1)
90-92: LGTM!The abstract token_count method correctly defines the interface contract for sparse embedding subclasses.
fastembed/sparse/sparse_text_embedding.py (1)
132-142: LGTM!The token_count implementation properly fulfills the abstract base class contract with clear documentation and consistent delegation.
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
210-220: LGTM!The token_count method correctly delegates to the internal _token_count implementation with appropriate documentation for cross-encoder pair tokenization.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
166-184: LGTM!The token_count method appropriately includes the include_extension parameter for late-interaction models, allowing users to control whether preprocessing tokens are counted. The implementation and documentation are clear and consistent.
fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
60-63: LGTM!The abstract method declaration follows the established pattern in the codebase and is consistent with similar abstract methods in
SparseTextEmbeddingBaseandTextEmbeddingBase. The signature correctly usesIterable[tuple[str, str]]to match the cross-encoder's pair-based API.tests/test_text_cross_encoder.py (1)
127-142: LGTM!The test correctly validates token counting behavior for cross-encoder pairs, checking both additivity across pairs and batching consistency. The test follows the established patterns in this file.
fastembed/sparse/minicoil.py (1)
190-191: LGTM!The implementation correctly delegates to the inherited
_token_countmethod fromOnnxTextModel, following the same pattern used inBM42andSpladePP.tests/test_late_interaction_embeddings.py (1)
321-344: LGTM!Comprehensive test coverage for the
token_countmethod, validating:
- Additivity of token counts across documents
- Extension tokens (+1 marker per document)
- Batching consistency
- Query mode with minimum length enforcement (32 tokens)
The test cases effectively exercise the
is_docandinclude_extensionparameters.
| def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int: | ||
| token_num = 0 | ||
| texts = texts if isinstance(texts, list) else [texts] | ||
| for text in texts: | ||
| document = remove_non_alphanumeric(text) | ||
| tokens = self.tokenizer.tokenize(document) | ||
| token_num += len(tokens) | ||
| return token_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix signature incompatibility and type handling.
Two issues flagged by the pipeline:
-
Signature mismatch: The method signature
**kwargsdoesn't match the base classSparseTextEmbeddingBase.token_count(texts, batch_size=1024). Thekwargsparameter is also unused. -
Type error on line 273: When
textsis anIterable[str](but not a list), wrapping it as[texts]creates a nested structure instead of a list of strings.
Apply this diff to fix both issues:
- def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int:
+ def token_count(self, texts: Union[str, Iterable[str]], batch_size: int = 1024) -> int:
token_num = 0
- texts = texts if isinstance(texts, list) else [texts]
+ if isinstance(texts, str):
+ texts = [texts]
+ elif not isinstance(texts, list):
+ texts = list(texts)
for text in texts:
document = remove_non_alphanumeric(text)
tokens = self.tokenizer.tokenize(document)
token_num += len(tokens)
return token_numNote: If batch_size should be used for batching (consistent with other implementations), consider using iter_batch as done in other token_count implementations.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int: | |
| token_num = 0 | |
| texts = texts if isinstance(texts, list) else [texts] | |
| for text in texts: | |
| document = remove_non_alphanumeric(text) | |
| tokens = self.tokenizer.tokenize(document) | |
| token_num += len(tokens) | |
| return token_num | |
| def token_count(self, texts: Union[str, Iterable[str]], batch_size: int = 1024) -> int: | |
| token_num = 0 | |
| if isinstance(texts, str): | |
| texts = [texts] | |
| elif not isinstance(texts, list): | |
| texts = list(texts) | |
| for text in texts: | |
| document = remove_non_alphanumeric(text) | |
| tokens = self.tokenizer.tokenize(document) | |
| token_num += len(tokens) | |
| return token_num |
🧰 Tools
🪛 GitHub Actions: type-checkers
[error] 271-271: mypy: Signature of 'token_count' incompatible with supertype SparseTextEmbeddingBase (override)
[error] 273-273: mypy: List item 0 has incompatible type 'str | Iterable[str]'; expected 'str' [list-item]
🪛 Ruff (0.14.6)
271-271: Unused method argument: kwargs
(ARG002)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
fastembed/sparse/bm25.py (1)
271-278: Silence unusedkwargsinBm25.token_countwhile keeping the flexible signature
token_countcorrectly handles both a single string and an iterable of strings now, but**kwargsis never used and Ruff flags it (ARG002). Since this method still needs to accept extra options (e.g.batch_size) for compatibility with the base class and wrapper APIs, a simple rename is enough:- def token_count(self, texts: Union[str, Iterable[str]], **kwargs: Any) -> int: + def token_count(self, texts: Union[str, Iterable[str]], **_: Any) -> int: token_num = 0 texts = [texts] if isinstance(texts, str) else texts for text in texts: document = remove_non_alphanumeric(text) tokens = self.tokenizer.tokenize(document) token_num += len(tokens) return token_numThis keeps the call surface flexible while making the linter happy.
fastembed/late_interaction_multimodal/colpali.py (1)
175-188: Maketoken_countrobust to lazy loading and avoid relying onassertforself.tokenizer.Two follow-ups here:
- Tokenizer initialization / lazy-load interaction (functional/robustness issue)
- With
lazy_load=True,load_onnx_modelis not called in__init__, soself.tokenizermay not exist or may beNonewhentoken_countis invoked before any embedding call.assert self.tokenizer is not Nonedoes not protect against the attribute being missing, and it is stripped when running Python with optimizations (python -O), which can reintroduce theNone-dereference.- Suggest guarding explicitly and either lazily loading or failing fast with a clear error before using
encode_batchortokenize:def token_count( self, texts: Union[str, Iterable[str]], batch_size: int = 1024, include_extension: bool = False, **kwargs: Any, ) -> int: - token_num = 0 - texts = [texts] if isinstance(texts, str) else texts - assert self.tokenizer is not None - tokenize_func = self.tokenize if include_extension else self.tokenizer.encode_batch - for batch in iter_batch(texts, batch_size): - token_num += sum([sum(encoding.attention_mask) for encoding in tokenize_func(batch)]) - return token_num + token_num = 0 + texts_iter: Iterable[str] = [texts] if isinstance(texts, str) else texts + + # Ensure tokenizer is initialized for both branches (include_extension uses self.tokenize). + if getattr(self, "tokenizer", None) is None: + # Option A: eager load + self.load_onnx_model() + # Option B (if you prefer explicit failure): + # raise RuntimeError("Tokenizer not initialized; call `load_onnx_model` or embed once before `token_count`.") + + assert self.tokenizer is not None + tokenize_func = ( + self.tokenize + if include_extension + else self.tokenizer.encode_batch # type: ignore[union-attr] + ) + for batch in iter_batch(texts_iter, batch_size): + token_num += sum( + sum(encoding.attention_mask) for encoding in tokenize_func(batch) + ) + return token_numThis keeps mypy happy via the
assertbut also makes the runtime behavior explicit and safer when used with lazy loading.
- Minor cleanups (optional)
- Ruff’s ARG002 warning on
kwargscan be silenced without changing behavior by e.g._ = kwargsat the top of the method, or by renaming to_kwargs: Any.- The inner list in
sum([sum(... ) for ...])is unnecessary; the generator form shown above is marginally more efficient and matches patterns in the rest of the codebase.
🧹 Nitpick comments (1)
fastembed/late_interaction/colbert.py (1)
99-127: Unusedkwargsparameter and logic clarification.The
kwargsparameter is accepted but not used (flagged by static analysis). This appears intentional for API consistency, but consider either:
- Prefixing with underscore (
**_kwargs) to indicate intentional non-use- Adding a brief comment explaining the parameter is for API compatibility
Otherwise, the implementation logic is correct—document mode sums attention masks directly, while query mode applies
MIN_QUERY_LENGTHpadding wheninclude_extension=True.def token_count( self, texts: Union[str, Iterable[str]], batch_size: int = 1024, is_doc: bool = True, include_extension: bool = False, - **kwargs: Any, + **_: Any, ) -> int:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (20)
fastembed/late_interaction/colbert.py(3 hunks)fastembed/late_interaction/late_interaction_embedding_base.py(1 hunks)fastembed/late_interaction/late_interaction_text_embedding.py(1 hunks)fastembed/late_interaction_multimodal/colpali.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(1 hunks)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py(1 hunks)fastembed/rerank/cross_encoder/onnx_text_model.py(1 hunks)fastembed/rerank/cross_encoder/text_cross_encoder.py(1 hunks)fastembed/rerank/cross_encoder/text_cross_encoder_base.py(1 hunks)fastembed/sparse/bm25.py(1 hunks)fastembed/sparse/bm42.py(1 hunks)fastembed/sparse/minicoil.py(1 hunks)fastembed/sparse/sparse_embedding_base.py(1 hunks)fastembed/sparse/sparse_text_embedding.py(1 hunks)fastembed/sparse/splade_pp.py(1 hunks)fastembed/text/onnx_embedding.py(1 hunks)fastembed/text/onnx_text_model.py(1 hunks)fastembed/text/text_embedding.py(1 hunks)fastembed/text/text_embedding_base.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- fastembed/rerank/cross_encoder/text_cross_encoder_base.py
- fastembed/text/onnx_text_model.py
🧰 Additional context used
🧬 Code graph analysis (8)
fastembed/late_interaction/late_interaction_embedding_base.py (10)
fastembed/late_interaction/colbert.py (1)
token_count(99-127)fastembed/late_interaction/late_interaction_text_embedding.py (1)
token_count(155-180)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-86)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-177)fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
token_count(61-63)fastembed/sparse/bm42.py (1)
token_count(355-358)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/splade_pp.py (1)
token_count(56-59)fastembed/text/onnx_embedding.py (1)
token_count(334-337)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/sparse/splade_pp.py (6)
fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/text/onnx_embedding.py (1)
token_count(334-337)fastembed/text/text_embedding_base.py (1)
token_count(73-75)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
_token_count(168-177)fastembed/text/onnx_text_model.py (1)
_token_count(162-172)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (11)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-80)fastembed/late_interaction/late_interaction_text_embedding.py (1)
token_count(155-180)fastembed/late_interaction_multimodal/colpali.py (1)
token_count(175-188)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
token_count(166-185)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
token_count(210-222)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-177)fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/sparse_embedding_base.py (1)
token_count(90-92)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-144)fastembed/text/text_embedding.py (1)
token_count(216-228)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/sparse/sparse_embedding_base.py (12)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-80)fastembed/late_interaction/late_interaction_text_embedding.py (1)
token_count(155-180)fastembed/late_interaction_multimodal/colpali.py (1)
token_count(175-188)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-86)fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
token_count(210-222)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-177)fastembed/sparse/bm25.py (1)
token_count(271-278)fastembed/sparse/bm42.py (1)
token_count(355-358)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-144)fastembed/sparse/splade_pp.py (1)
token_count(56-59)fastembed/text/text_embedding.py (1)
token_count(216-228)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (8)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-177)fastembed/rerank/cross_encoder/text_cross_encoder_base.py (1)
token_count(61-63)fastembed/sparse/bm42.py (1)
token_count(355-358)fastembed/sparse/minicoil.py (1)
token_count(190-193)fastembed/sparse/splade_pp.py (1)
token_count(56-59)fastembed/text/text_embedding.py (1)
token_count(216-228)fastembed/rerank/cross_encoder/onnx_text_model.py (1)
_token_count(168-177)fastembed/text/onnx_text_model.py (1)
_token_count(162-172)
fastembed/late_interaction_multimodal/colpali.py (3)
fastembed/common/utils.py (1)
iter_batch(35-45)fastembed/late_interaction/colbert.py (2)
token_count(99-127)tokenize(83-88)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
token_count(80-86)
fastembed/text/text_embedding.py (7)
fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
token_count(210-222)fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
token_count(165-177)fastembed/sparse/bm42.py (1)
token_count(355-358)fastembed/sparse/sparse_text_embedding.py (1)
token_count(132-144)fastembed/sparse/splade_pp.py (1)
token_count(56-59)fastembed/text/onnx_embedding.py (1)
token_count(334-337)fastembed/text/text_embedding_base.py (1)
token_count(73-75)
fastembed/late_interaction/colbert.py (3)
fastembed/common/utils.py (1)
iter_batch(35-45)fastembed/late_interaction/late_interaction_embedding_base.py (1)
token_count(73-80)fastembed/text/text_embedding.py (1)
token_count(216-228)
🪛 Ruff (0.14.6)
fastembed/late_interaction_multimodal/colpali.py
180-180: Unused method argument: kwargs
(ARG002)
fastembed/late_interaction/colbert.py
105-105: Unused method argument: kwargs
(ARG002)
fastembed/sparse/bm25.py
271-271: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (14)
fastembed/rerank/cross_encoder/text_cross_encoder.py (1)
165-177: Delegatedtoken_countis consistent with existing cross‑encoder APIThe implementation cleanly mirrors
rerank/rerank_pairs, forwardingpairs,batch_size, and**kwargsto the underlying model. To avoid runtime surprises, double‑check that every class inCROSS_ENCODER_REGISTRYimplements a compatibletoken_countmethod.fastembed/text/text_embedding.py (1)
216-228: Publictoken_countonTextEmbeddingmatches sparse/ONNX patternsThis delegator keeps the high‑level API consistent with sparse embeddings and ONNX text models by forwarding
texts,batch_size, and**kwargsto the underlying model. Please confirm that every class inEMBEDDINGS_REGISTRYimplementstoken_countso this wrapper never falls back to the baseNotImplementedError.fastembed/sparse/minicoil.py (1)
190-193: MiniCOILtoken_countwiring is straightforward and consistentThe method cleanly forwards to
_token_countwithtexts,batch_size, and**kwargs, matching the pattern used in other ONNX‑backed sparse embedders.fastembed/rerank/cross_encoder/onnx_text_cross_encoder.py (1)
210-222: ONNX cross‑encodertoken_countis correctly delegated to_token_countThis method aligns with the abstract
TextCrossEncoderBase.token_countand simply routespairsandbatch_sizeinto the shared_token_countimplementation. As long as tests cover a few representative models, this looks good.fastembed/text/text_embedding_base.py (1)
73-75: Basetoken_countcontract is clear and matches sparse base classesAdding
token_count(self, texts, **kwargs)with aNotImplementedErrorformalizes the interface for text embeddings and lines up with the sparse base class. Just ensure every concreteTextEmbeddingBasesubclass now overrides this, otherwise callers may hit the base implementation at runtime.fastembed/sparse/splade_pp.py (1)
56-59: SpladePPtoken_countfollows the shared ONNX text model patternDelegating directly to
_token_countwithtexts,batch_size, and**kwargskeeps SpladePP consistent with other sparse ONNX models and avoids duplicating tokenization logic.fastembed/sparse/bm42.py (1)
355-358: Bm42token_countis correctly hooked into the shared ONNX tokenizer pathThe method simply forwards
texts,batch_size, and**kwargsto_token_count, matching the pattern in other sparse ONNX models. With tests exercising sparsetoken_count, this should behave as expected.fastembed/text/onnx_embedding.py (1)
334-338: LGTM!The
token_countmethod correctly delegates to the inherited_token_countimplementation fromOnnxTextModel, following the consistent delegation pattern used across other embedding classes in this PR.fastembed/late_interaction/late_interaction_embedding_base.py (1)
72-80: LGTM!The abstract
token_countmethod establishes a consistent interface for subclasses, with a signature that aligns with other base classes likeTextEmbeddingBaseandSparseTextEmbeddingBase.fastembed/late_interaction/late_interaction_text_embedding.py (1)
154-180: LGTM!The
token_countmethod correctly delegates to the underlying model with appropriate late-interaction-specific parameters (is_doc,include_extension). The docstring clearly explains the purpose of each parameter.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
165-185: LGTM!The
token_countmethod appropriately delegates to the underlyingColPalimodel with theinclude_extensionparameter, which controls whether preprocessing tokens are counted. The signature correctly omits theis_docparameter since multimodal embeddings handle text differently than pure late-interaction text models.fastembed/sparse/sparse_text_embedding.py (1)
131-144: LGTM!The
token_countmethod follows the established delegation pattern, forwarding the call to the underlying sparse embedding model implementation. This is consistent with the approach used in other embedding wrapper classes.fastembed/sparse/sparse_embedding_base.py (1)
89-92: LGTM!The abstract
token_countmethod provides a consistent interface contract for sparse embedding subclasses, matching the pattern used inTextEmbeddingBaseand other base classes wherebatch_sizeis passed via**kwargs.fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
80-86: Abstracttoken_countinterface looks consistent and future-proof.The new abstract
token_countmirrors the other base classes (acceptstextsplus**kwargs) and gives concrete multimodal models room to supportbatch_size,include_extension, etc., while keeping the public contract clear. No issues from an interface/typing standpoint.
* new: add token count method * fix: fix mypy * fix: load model in token_count * fix: remove debug code
This PR adds
token_countmethods to all text embeddingsThe most tricky ones are
colbertandcolpalidue to their internal preprocessingsColbert requires a query to be at least 32 tokens, however, there might be different use cases and due to this reason
Colbert and Colpali has 2 modes - when preprocessing is included and when it's not
#584