ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.
PyPI Counter |
|
Github Stars |
|
Branch | main | dev |
CI |
|
|
- Check Python Packaging User Guide
- Run
pip install tocount==0.1
- Download Version 0.1 or Latest Source
- Run
pip install .
Model Name | Type | MAE | MSE | R² |
---|---|---|---|---|
RULE_BASED.UNIVERSAL |
Rule-Based | 106.70 | 381,647.81 | 0.8175 |
RULE_BASED.GPT_4 |
Rule-Based | 152.34 | 571,795.89 | 0.7266 |
RULE_BASED.GPT_3_5 |
Rule-Based | 161.93 | 652,923.59 | 0.6878 |
ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].
>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4
Just fill an issue and describe it. We'll check it ASAP! or send an email to [email protected].
- Please complete the issue template
1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.
Give a ⭐️ if this project helped you!
If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .