Skip to content

openscilab/tocount

ToCount: Lightweight Token Estimator


PyPI version built with Python3 GitHub repo size

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars
Branch main dev
CI

Installation

PyPI

Source code

Models

Model Name Type MAE MSE
RULE_BASED.UNIVERSAL Rule-Based 106.70 381,647.81 0.8175
RULE_BASED.GPT_4 Rule-Based 152.34 571,795.89 0.7266
RULE_BASED.GPT_3_5 Rule-Based 161.93 652,923.59 0.6878

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [email protected].

  • Please complete the issue template

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.
2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

ToCount Donation

About

ToCount: Lightweight Token Estimator

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •