Skip to content

TDRH-Undergraduate-Students/tokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Everything You Ever Wanted to Know About Text Tokenization (But Were Afraid to Ask)

  • Writers: Thaís Martins, Thainara Assis, Gustavo Bernardo

  • Reviewer: João Gabriel Matos, Thaís Martins

Welcome to our Tokenization Research Project, developed by the TELUS Digital Research Hub Undergraduate Students.

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on large language models.

For the complete study and detailed analysis, see tokenization.md. You can also explore the interactive version by uploading tokenization.ipynb

Summary

  1. Introduction
  2. Word-Level Tokenization
  3. Character-Level Tokenization
  4. Subword-Level Tokenization
  5. Why Models Struggle With Letters and Spelling
  6. Tokenization of Different Languages
  7. Tokenization of Mathematics
  8. Tokenization of Programming Languages
  9. A Brief Look at Tokenization and How It Leads to Vectorization

Contact

If you have any questions or would like to get in touch, feel free to contact us:

About

This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors