Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Writers: Thaís Martins, Thainara Assis, Gustavo Bernardo
Reviewer: João Gabriel Matos, Thaís Martins

Welcome to our Tokenization Research Project, developed by the TELUS Digital Research Hub Undergraduate Students.

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on large language models.

For the complete study and detailed analysis, see tokenization.md. You can also explore the interactive version by uploading tokenization.ipynb

Summary

Introduction
Word-Level Tokenization
Character-Level Tokenization
Subword-Level Tokenization
Why Models Struggle With Letters and Spelling
Tokenization of Different Languages
Tokenization of Mathematics
Tokenization of Programming Languages
A Brief Look at Tokenization and How It Leads to Vectorization

Contact

If you have any questions or would like to get in touch, feel free to contact us:

Thaís Martins — thais_martins@usp.br
Thainara Assis — thainaraassisgoulart@usp.br
Gustavo Bernardo — gustavo.bernardo@usp.br
João Gabriel Matos — gabrielmattos@usp.br

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Summary

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

Summary

Contact