This project focuses on transforming raw text into numerical representations and learning word embeddings to capture semantic relationships between words. It consists of two main components:
-
Text Encoding (
Encode_textclass)- Handles text preprocessing, tokenization, and one-hot encoding.
- Key functions:
fit_tokenizer(builds vocabulary),tokenizer(converts sentences to tokens and indices),one_hot_encoding(generates one-hot vectors).
-
Word Embedding (
Embeddingclass)- Implements a CBOW (Continuous Bag of Words) model to learn word embeddings.
- Key functions:
Embedding_text(trains embeddings),generate_context_target(creates context-target pairs),Forward/update_weight(forward pass and gradient updates),display_embedding_space(visualizes learned embeddings in 1D, 2D, or 3D).
- Convert raw text into numerical form suitable for machine learning.
- Learn meaningful vector representations for words.
- Enable visualization of relationships between words in an embedding space.
- Fit the tokenizer on a text corpus:
encoder.fit_tokenizer(corpus) - Tokenize a sentence:
encoder.tokenizer("Example sentence") - Train embeddings:
embedding.Embedding_text(text, len_context=5) - Visualize the learned embedding space:
embedding.display_embedding_space(Embedding_dim=2)
All the images below show the same embedding space, represented in different dimensions.


