-
Writers: Thaís Martins, Thainara Assis, Gustavo Bernardo
-
Reviewer: João Gabriel Matos, Thaís Martins
Welcome to our Tokenization Research Project, developed by the TELUS Digital Research Hub Undergraduate Students.
This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on large language models.
For the complete study and detailed analysis, see tokenization.md. You can also explore the interactive version by uploading tokenization.ipynb
- Introduction
- Word-Level Tokenization
- Character-Level Tokenization
- Subword-Level Tokenization
- Why Models Struggle With Letters and Spelling
- Tokenization of Different Languages
- Tokenization of Mathematics
- Tokenization of Programming Languages
- A Brief Look at Tokenization and How It Leads to Vectorization
If you have any questions or would like to get in touch, feel free to contact us:
- Thaís Martins — thais_martins@usp.br
- Thainara Assis — thainaraassisgoulart@usp.br
- Gustavo Bernardo — gustavo.bernardo@usp.br
- João Gabriel Matos — gabrielmattos@usp.br