Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 1.67 KB

File metadata and controls

30 lines (22 loc) · 1.67 KB

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

  • Writers: Thaís Martins, Thainara Assis, Gustavo Bernardo

  • Reviewer: João Gabriel Matos, Thaís Martins

Welcome to our Tokenization Research Project, developed by the TELUS Digital Research Hub Undergraduate Students.

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on large language models.

For the complete study and detailed analysis, see tokenization.md. You can also explore the interactive version by uploading tokenization.ipynb

Summary

  1. Introduction
  2. Word-Level Tokenization
  3. Character-Level Tokenization
  4. Subword-Level Tokenization
  5. Why Models Struggle With Letters and Spelling
  6. Tokenization of Different Languages
  7. Tokenization of Mathematics
  8. Tokenization of Programming Languages
  9. A Brief Look at Tokenization and How It Leads to Vectorization

Contact

If you have any questions or would like to get in touch, feel free to contact us: