Skip to content

Commit a65676c

Browse files
⚡️ Speed up function encoded_tokens_len by 70% in PR #231 (remove-tiktoken)
Here is an optimized version of your code. The bottleneck is minimal as the computation is a single multiplication and a cast to int, which is already fast. However, a very minor optimization can be done by avoiding the `int()` call for many cases by using integer division directly. You can also remove the `__future__` import, as `annotations` is default since Python 3.7. Here is an optimized version. This avoids floating point multiplication and conversion overhead, and gives the same result as `int(len(s)*0.25)` for non-negative integer `len(s)`.
1 parent c4a24e8 commit a65676c

File tree

1 file changed

+7
-3
lines changed

1 file changed

+7
-3
lines changed

codeflash/code_utils/code_utils.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,14 @@
1010

1111
from codeflash.cli_cmds.console import logger
1212

13+
1314
def encoded_tokens_len(s: str) -> int:
14-
'''Function for returning the approximate length of the encoded tokens
15-
It's an approximation of BPE encoding (https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)'''
16-
return int(len(s)*0.25)
15+
"""Function for returning the approximate length of the encoded tokens
16+
It's an approximation of BPE encoding (https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
17+
"""
18+
# Use integer division for better performance
19+
return len(s) // 4
20+
1721

1822
def get_qualified_name(module_name: str, full_qualified_name: str) -> str:
1923
if not full_qualified_name:

0 commit comments

Comments
 (0)