This project trains a custom Word2Vec model on The Godfather / Mafia-themed text corpus and provides tools to:
preprocess raw text
train a Word2Vec model
visualize embeddings (2D + 3D PCA)
track experiments using MLflow
log artifacts (model, visualizations)
load + test model interactively
run analysis scripts
This repository is built with a clean, scalable project structure following good ML engineering practices.
Mafia-Word2Vec/
│
├── analyze/
│ ├── visualize_pca.py # PCA 2D/3D visualization
│
├── data/
│ ├── raw/ # Original text files
│ ├── processed/ # Cleaned / tokenized text
│ ├── models/ # Saved Word2Vec model
│
├── src/
│ ├── config.py # Paths and constants
│ ├── preprocess.py # Clean → tokenize → save text
│ ├── train.py # Train + log model to MLflow
│ ├── test_model.py # Manual testing of learned vectors
│
├── app/
│ ├── fastapi_app.py
│ ├── streamlit_app.py
│
├── notebooks/
│ ├── data_preprocessing.ipynb
│ ├── model_training.ipynb
│
├── .github/workflows/
| ├── ci.yml
│
├── requirements.txt
├── .flake8
├── pyproject.toml
├── .dvcignore
├── .dockerignore
├── .gitignore
├── README.md
✔ Word2Vec Training
Train a custom embedding model using gensim.
✔ MLflow Integration
Track:
hyperparameters
metrics
artifacts (saved model + PCA visualizations)
✔ PCA Visualization
2D Matplotlib plot
3D Plotly interactive plot (HTML)
✔ Config-based Paths
All file paths are controlled from src/config.py.
✔ Modular Pipeline
Each stage can be executed independently:
preprocess
train
analyze
test
Clone the repo
git clone https://github.com/ashiq-km/Mafia-Word2vec-.git cd Mafia-Word2vec-
Install dependencies
pip install -r requirements.txt
Place your raw .txt file inside:
data/raw/
Then run:
python -m src.preprocess
This generates:
data/processed/godfather_cleaned.txt
Start MLflow UI:
mlflow ui
Then in a new terminal:
python -m src.train
This will:
✔ train the model ✔ save godfather_w2v.model ✔ log hyperparameters ✔ log metrics (vocab size) ✔ generate a 3D PCA plot ✔ upload artifacts to MLflow
Artifacts stored at:
mlruns/<experiment_id>/<run_id>/artifacts/
Run PCA script independently:
python -m analyze.visualize_pca
It produces:
analyze/pca_visual.html
A fully interactive 3D scatter plot.
python -m src.test_model
python -m src.test_model
Or load interactively:
from gensim.models import Word2Vec model = Word2Vec.load("data/models/godfather_w2v.model") model.wv.most_similar("godfather")
All paths are centrally stored:
BASE_DIR RAW_DATA_FILE PROCESSED_DATA_FILE MODEL_FILE
Build the image:
docker build -t mafia-w2v .
Run the container:
docker run -it mafia-w2v
Pull requests are welcome. For major changes, please open an issue first to discuss what you’d like to change.
