✨ Mafia Word2Vec – Custom NLP Embeddings Project

This project trains a custom Word2Vec model on The Godfather / Mafia-themed text corpus and provides tools to:

preprocess raw text

train a Word2Vec model

visualize embeddings (2D + 3D PCA)

track experiments using MLflow

log artifacts (model, visualizations)

load + test model interactively

run analysis scripts

This repository is built with a clean, scalable project structure following good ML engineering practices.


Mafia-Word2Vec/
│
├── analyze/
│   ├── visualize_pca.py       # PCA 2D/3D visualization
│
├── data/
│   ├── raw/                   # Original text files
│   ├── processed/             # Cleaned / tokenized text
│   ├── models/                # Saved Word2Vec model
│
├── src/
│   ├── config.py              # Paths and constants
│   ├── preprocess.py          # Clean → tokenize → save text
│   ├── train.py               # Train + log model to MLflow
│   ├── test_model.py          # Manual testing of learned vectors
│
├── app/
│    ├── fastapi_app.py
│    ├── streamlit_app.py
│
├── notebooks/
│    ├── data_preprocessing.ipynb
│    ├── model_training.ipynb
│
├── .github/workflows/
|    ├── ci.yml
│
├── requirements.txt
├── .flake8
├── pyproject.toml
├── .dvcignore
├── .dockerignore
├── .gitignore        
├── README.md

🚀 Features

✔ Word2Vec Training

Train a custom embedding model using gensim.

✔ MLflow Integration

Track:

hyperparameters

metrics

artifacts (saved model + PCA visualizations)

✔ PCA Visualization

2D Matplotlib plot

3D Plotly interactive plot (HTML)

✔ Config-based Paths

All file paths are controlled from src/config.py.

✔ Modular Pipeline

Each stage can be executed independently:

preprocess

train

analyze

test

🛠️ Installation

Clone the repo

git clone https://github.com/ashiq-km/Mafia-Word2vec-.git cd Mafia-Word2vec-

Install dependencies

pip install -r requirements.txt

📦 1. Preprocess the Raw Text

Place your raw .txt file inside:

data/raw/

Then run:

python -m src.preprocess

This generates:

data/processed/godfather_cleaned.txt

🤖 2. Train Word2Vec + Log to MLflow

Start MLflow UI:

mlflow ui

Then in a new terminal:

python -m src.train

This will:

✔ train the model ✔ save godfather_w2v.model ✔ log hyperparameters ✔ log metrics (vocab size) ✔ generate a 3D PCA plot ✔ upload artifacts to MLflow

Artifacts stored at:

mlruns/<experiment_id>/<run_id>/artifacts/

📊 3. Visualize Word Embeddings

Run PCA script independently:

python -m analyze.visualize_pca

It produces:

analyze/pca_visual.html

A fully interactive 3D scatter plot.

🧪 4. Test the Model

python -m src.test_model

Or load interactively:

from gensim.models import Word2Vec model = Word2Vec.load("data/models/godfather_w2v.model") model.wv.most_similar("godfather")

📁 Configuration (src/config.py)

All paths are centrally stored:

BASE_DIR RAW_DATA_FILE PROCESSED_DATA_FILE MODEL_FILE

🐳 Docker Support (optional)

Build the image:

docker build -t mafia-w2v .

Run the container:

docker run -it mafia-w2v

🤝 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you’d like to change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ Mafia Word2Vec – Custom NLP Embeddings Project

🚀 Features

🛠️ Installation

📦 1. Preprocess the Raw Text

🤖 2. Train Word2Vec + Log to MLflow

📊 3. Visualize Word Embeddings

🧪 4. Test the Model

📁 Configuration (src/config.py)

🐳 Docker Support (optional)

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.devcontainer		.devcontainer
.dvc		.dvc
.github/workflows		.github/workflows
analyze		analyze
app		app
data		data
notebooks		notebooks
src		src
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

ashiq-km/Mafia-Word2vec-

Folders and files

Latest commit

History

Repository files navigation

✨ Mafia Word2Vec – Custom NLP Embeddings Project

🚀 Features

🛠️ Installation

📦 1. Preprocess the Raw Text

🤖 2. Train Word2Vec + Log to MLflow

📊 3. Visualize Word Embeddings

🧪 4. Test the Model

📁 Configuration (src/config.py)

🐳 Docker Support (optional)

🤝 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages