word2vec in pure NumPy, using skip-gram with negative sampling. Trained on text8 (cleaned Wikipedia corpus). Developed for a JetBrains internship application.
Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activateInstall dependencies:
pip install -r requirements.txtEdit config.py:
EMBEDDING_DIMENSION = 100
WINDOW_SIZE = 3
K = 3
LEARNING_RATE = 0.01
EPOCHS = 1
DATA_FRACTION = 0.1Notes:
DATA_FRACTION: fraction of corpus used for training (for faster training, must retain value when testing same embeddings)WINDOW_SIZE: context radius on each sideK: number of negative samples per positive pairEPOCHS: passes through text corpus
Run training:
python train.pyWhat happens:
dataset.pydownloads/extractsdata/text8if missing- training runs with a tqdm progress bar per epoch
- embeddings are saved to
output/embeddings.npy
Query similar words:
python test.py king
python test.py science --top-k 20Classic analogy:
python test.py --analogy king man womanThis computes:
embedding("king") - embedding("man") + embedding("woman")
and returns top candidates by cosine similarity.