This project provides a modular FastAPI-based machine learning service that performs unsupervised clustering on retail data (Northwind-style) for products, suppliers, and customers using the DBSCAN algorithm. It includes a complete pipeline for data collection, preprocessing, model training, visualization, and API-based interaction.
- Automated clustering using DBSCAN for multiple entities:
- Products (Problem 2)
- Suppliers (Problem 3)
- Customers by country (Problem 4)
- Dynamic preprocessing pipeline for feature scaling and data cleaning
- Automatic parameter optimization via silhouette score and elbow method (KneeLocator)
- Visualization support for clusters and k-distance plots
- FastAPI endpoints for model training and CSV download
- Persistent model storage with
joblib - Extensible modular design — easy to adapt for other datasets or clustering methods
- Backend: FastAPI (Python 3.10+)
- Machine Learning: scikit-learn (DBSCAN, StandardScaler, silhouette analysis)
- Data Handling: pandas, SQLAlchemy, PostgreSQL
- Visualization: Matplotlib, kneed
- Environment & Tools: joblib, python-dotenv, Docker-ready structure
.
├── app.py # FastAPI application with endpoints (/train, /download)
├── database.py # PostgreSQL connection and data collection via SQLAlchemy
├── db_connection.py # Environment-based DB config (dotenv)
├── preprocessing.py # Data preprocessing and feature engineering
├── training.py # DBSCAN training and parameter optimization
├── visualization.py # Cluster and eps visualization
├── models/ # Trained model storage (pkl files)
├── outputs/ # Cluster results and generated plots
└── .env # Database credentials
Train and cluster entities using DBSCAN.
POST /train/problem_2 → Product clustering
POST /train/problem_3 → Supplier clustering
POST /train/problem_4 → Customer-country clustering
Each endpoint returns a success message and the generated CSV path.
GET /download/problem_2
GET /download/problem_3
GET /download/problem_4
Each route provides a downloadable CSV file of the clustered output.
- Data Collection – Fetches tables from a PostgreSQL database using SQLAlchemy.
- Preprocessing – Handles missing values, normalization, and feature scaling via StandardScaler.
- Training – Runs DBSCAN with optimized parameters (epsilon and min_samples) determined via silhouette score and knee detection.
- Visualization – Saves k-distance and cluster distribution plots under the
outputs/directory. - API Interaction – Use FastAPI to trigger clustering and download results.
curl -X POST http://127.0.0.1:8000/train/problem_2Response:
{
"message": "Problem 2 model trained successfully.",
"file": "outputs/dbscan_clustered_products_problem_2.csv"
}curl -O http://127.0.0.1:8000/download/problem_2During training, DBSCAN automatically generates the following plots:
*_eps_plot.png→ elbow curve for optimal epsilon*_clusters_plot.png→ visual cluster separation for the selected entity
All visualizations are saved under the outputs/ folder.
git clone https://github.com/yourusername/retail-segmentation-api.git
cd retail-segmentation-apipip install -r requirements.txtCreate a .env file with your database credentials:
DB_USER=your_username
DB_PASSWORD=your_password
DB_HOST=localhost
DB_PORT=5432
DB_NAME=northwind
uvicorn app:app --reloadAccess it at: 👉 http://127.0.0.1:8000/docs
fastapi
uvicorn
pandas
numpy
scikit-learn
sqlalchemy
matplotlib
kneed
python-dotenv
joblib
psycopg2