Skip to content

Lucas-Coussy/Tokenizer-and-Embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenizer and Embedding Project

Overview

This project focuses on transforming raw text into numerical representations and learning word embeddings to capture semantic relationships between words. It consists of two main components:

  1. Text Encoding (Encode_text class)

    • Handles text preprocessing, tokenization, and one-hot encoding.
    • Key functions: fit_tokenizer (builds vocabulary), tokenizer (converts sentences to tokens and indices), one_hot_encoding (generates one-hot vectors).
  2. Word Embedding (Embedding class)

    • Implements a CBOW (Continuous Bag of Words) model to learn word embeddings.
    • Key functions: Embedding_text (trains embeddings), generate_context_target (creates context-target pairs), Forward/update_weight (forward pass and gradient updates), display_embedding_space (visualizes learned embeddings in 1D, 2D, or 3D).

Purpose

  • Convert raw text into numerical form suitable for machine learning.
  • Learn meaningful vector representations for words.
  • Enable visualization of relationships between words in an embedding space.

Example Workflow

  1. Fit the tokenizer on a text corpus: encoder.fit_tokenizer(corpus)
  2. Tokenize a sentence: encoder.tokenizer("Example sentence")
  3. Train embeddings: embedding.Embedding_text(text, len_context=5)
  4. Visualize the learned embedding space: embedding.display_embedding_space(Embedding_dim=2)

Embedding representation

All the images below show the same embedding space, represented in different dimensions.

1D Embedding Space

Embedding Space 1D

2D Embedding Space

Embedding Space 2D

3D Embedding Space

Embedding Space 3D

About

Python classes that implement Basic tokenizer and Bag-of-Word approach for Embedding

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages