ToCount: Lightweight Token Estimator

Overview

ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.

PyPI Counter
Github Stars

Branch	main	dev
CI

Installation

PyPI

Check Python Packaging User Guide
Run pip install tocount==0.1

Source code

Download Version 0.1 or Latest Source
Run pip install .

Models

Model Name	Type	MAE	MSE	R²
`RULE_BASED.UNIVERSAL`	Rule-Based	106.70	381,647.81	0.8175
`RULE_BASED.GPT_4`	Rule-Based	152.34	571,795.89	0.7266
`RULE_BASED.GPT_3_5`	Rule-Based	161.93	652,923.59	0.6878

ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

Usage

>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4

Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [email protected].

Please complete the issue template

References

1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.

2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.

Show your support

Star this repo

Give a ⭐️ if this project helped you!

Donate to our project

If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
otherfiles		otherfiles
tests		tests
tocount		tocount
.coveragerc		.coveragerc
.gitignore		.gitignore
.pydocstyle		.pydocstyle
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
autopep8.bat		autopep8.bat
autopep8.sh		autopep8.sh
codecov.yml		codecov.yml
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ToCount: Lightweight Token Estimator

Overview

Installation

PyPI

Source code

Models

Usage

Issues & bug reports

References

Show your support

Star this repo

Donate to our project

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

openscilab/tocount

Folders and files

Latest commit

History

Repository files navigation

ToCount: Lightweight Token Estimator

Overview

Installation

PyPI

Source code

Models

Usage

Issues & bug reports

References

Show your support

Star this repo

Donate to our project

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages