Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Writers: Thaís Martins, Thainara Assis, Gustavo Bernardo
Reviewer: João Gabriel Matos, Thaís Martins

Welcome to our Tokenization Research Project, developed by the TELUS Digital Research Hub Undergraduate Students.

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on large language models.

For the complete study and detailed analysis, see tokenization.md. You can also explore the interactive version by uploading tokenization.ipynb

Summary

Introduction
Word-Level Tokenization
Character-Level Tokenization
Subword-Level Tokenization
Why Models Struggle With Letters and Spelling
Tokenization of Different Languages
Tokenization of Mathematics
Tokenization of Programming Languages
A Brief Look at Tokenization and How It Leads to Vectorization

Contact

If you have any questions or would like to get in touch, feel free to contact us:

Thaís Martins — thais_martins@usp.br
Thainara Assis — thainaraassisgoulart@usp.br
Gustavo Bernardo — gustavo.bernardo@usp.br
João Gabriel Matos — gabrielmattos@usp.br

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
figures		figures
README.md		README.md
tokenization.ipynb		tokenization.ipynb
tokenization.md		tokenization.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Summary

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Summary

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages