Skip to content

farisd16/word2vec-numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec from scratch (NumPy)

word2vec in pure NumPy, using skip-gram with negative sampling. Trained on text8 (cleaned Wikipedia corpus). Developed for a JetBrains internship application.

Setup

Create and activate virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Configuration

Edit config.py:

EMBEDDING_DIMENSION = 100
WINDOW_SIZE = 3
K = 3
LEARNING_RATE = 0.01
EPOCHS = 1
DATA_FRACTION = 0.1

Notes:

  • DATA_FRACTION: fraction of corpus used for training (for faster training, must retain value when testing same embeddings)
  • WINDOW_SIZE: context radius on each side
  • K: number of negative samples per positive pair
  • EPOCHS: passes through text corpus

Train

Run training:

python train.py

What happens:

  • dataset.py downloads/extracts data/text8 if missing
  • training runs with a tqdm progress bar per epoch
  • embeddings are saved to output/embeddings.npy

Evaluate: Nearest Neighbors

Query similar words:

python test.py king
python test.py science --top-k 20

Evaluate: Vector Arithmetic (Analogy)

Classic analogy:

python test.py --analogy king man woman

This computes:

  • embedding("king") - embedding("man") + embedding("woman")

and returns top candidates by cosine similarity.

About

word2vec written from scratch with NumPy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages