Drift Detection Monitoring System - Prometheus, Grafana, AlertManager, and Slack

In this project, we give a practical, end-to-end MLOps project that detects data / concept drift, exports drift metrics to Prometheus, visualizes & alerts in Grafana, Alertmanager, and Slack.

GOAL

Model Service (FAstAPI): Makes predictions, detects drift, and exposes "/metrics" endpoint.
Prometheus: Comprehensively collects, stores metrics & evaluates alert rules.
Grafana: Visualizes metrics on an Interactive UI (dashboard).
Real-time Drift Detection: Statistical Methods (Kolmogorov Smirnov(KS test)-Numerical data, Population Stability Index(PSI)-Categorical data)
Alertmanager: Sends alerts to Slack when drift is detected.
Slack: Messaging platform for receiving alerts (on phone, laptop, etc.) when drift exceeds threshold.
Docker Compose: For easy deployment of Docker Containers for the Services.

Project Structure

Create the necessary files and directories in the project root directory:

mkdir -p drift-monitoring/{prometheus,grafana/provisioning/{datasources,dashboards},model-service,alertmanager}
cd drift-monitoring

Model Service (Python/FastAPI)

app.py: Main API with endpoints
drift_detector.py: KS Test & PSI algorithms
data_generator.py: Synthetic data simulation
Real-time drift detection

Model-Service API Endpoints

Model Service:

GET / - Service status
POST /predict - Make prediction
GET /drift/status - Current drift status
POST /simulate/drift - Simulate drift for testing
GET /metrics - Prometheus metrics
GET /health - Health check

touch .env docker-compose.yml

Build and Start the Services

# Build and start all services
docker-compose up -d --build

# Wait for all services to be ready (30 seconds)
sleep 30

# Check running docker-compose processes
docker-compose ps

# Check logs
docker-compose logs -f

# Check logs of individual service (model-service, prometheus, grafana, alertmanager)
docker-compose logs -f model-service

# Stop the running docker processes
docker-compose down -v

# Remove everything including images
docker-compose down -v --rmi all

Verify Service Health

# Check model service
curl http://localhost:8000/health

# Check Prometheus
curl http://localhost:9090/-/healthy

# Check Grafana
curl http://localhost:3000/api/health

Access the Services

# Model Service API
http://localhost:8000

# Prometheus
http://localhost:9090

# Grafana (username: admin, password: admin)
http://localhost:3000

# Alertmanager
http://localhost:9093

Setup Slack

Install, create account and sign-in to your Slack Account.

sudo snap install slack --classic

Create a Slack Incoming Webhook

Go to your Slack workspace. Click File - Settings & Administration - Manage apps.
In the Search bar on the top right, search and open "Incoming Webhooks".
Click "Add to Slack".
Scroll down to "Post to Channel" and select the channel where you want to post alerts (or Create a New Channel).
Click "Add Incoming Webhooks Integration".
Copy & Save the Webhook URL, e.g.: https://hooks.slack.com/services/T09NYD7D30R/B09NHJYA5GE/u2he99h3f79h23hy9rK

Configure Grafana to Use the Slack Webhook

Add SLACK_WEBHOOK_URL=https://hooks.slack.com/services/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uZj to the .env file.
Add slack_api_url: 'https://hooks.slack.com/services/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Zj' to the alertmanager.yml.

Generate Normal Traffic

We will generate normal traffic from the ML Model using the "test_normal.sh" script to make some predictions.

# Make the shell script executable and run it
chmod +x test_normal.sh

./test_normal.sh

Verify Metrics in Prometheus

On the Prometheus page, try the following queries:

model_drift_score
model_drift_score{method="ks_test"}
model_drift_score{method="psi"}

# Should show firing alerts
ALERTS{alertstate="firing"}

# Should show specific drift alerts
ALERTS{alertname="DataDriftDetected"}

Verify on Grafana Dashboard as well.

Grafana Dashboard Includes the following

Drift Scores by Features: Bar guage showing current drift
Drift Score Over Time: Time series of drift evolution
Prediction Rate: Predictions per second
Drift Alerts: Counts of alerts in last hour
Total Predictions: Cummulative count of predictions
Feature Means: Statistical tracking
Feature Std Deviations: Variance monitoring
PSI Scores: Alternative drift metric
Prediction Latency: Performance monitoring

Alerting

Alerts are triggered when:

Data Drift Detected: Drift score > 0.3 for 1 minute
Critical Data Drift: Drift score > 0.5 for 30 seconds
Prediction Distribution Shift: Rate drops significantly
Model Service Down: Service unreachable for 1 minute

Alerts are sent to Slack with:

Alert name and severity
Drift score and threshold
Feature name
Recommended actions

Full Cleanup

To do a total cleanup of everything, use the cleanup.sh script:

chmod +x cleanup.sh
./cleanup.sh

Please LIKE, COMMENT, and SUBSCRIBE !!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Drift Detection Monitoring System - Prometheus, Grafana, AlertManager, and Slack

GOAL

Project Structure

Model Service (Python/FastAPI)

Model-Service API Endpoints

Build and Start the Services

Verify Service Health

Access the Services

Setup Slack

Create a Slack Incoming Webhook

Configure Grafana to Use the Slack Webhook

Generate Normal Traffic

Verify Metrics in Prometheus

Grafana Dashboard Includes the following

Alerting

Full Cleanup

Please LIKE, COMMENT, and SUBSCRIBE !!!

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
alertmanager		alertmanager
grafana/provisioning		grafana/provisioning
model-service		model-service
prometheus		prometheus
.env		.env
LICENSE		LICENSE
README.md		README.md
cleanup.sh		cleanup.sh
docker-compose.yml		docker-compose.yml
test_normal.sh		test_normal.sh

License

iQuantC/ML_Drift-Detection_Monitoring_System

Folders and files

Latest commit

History

Repository files navigation

Drift Detection Monitoring System - Prometheus, Grafana, AlertManager, and Slack

GOAL

Project Structure

Model Service (Python/FastAPI)

Model-Service API Endpoints

Build and Start the Services

Verify Service Health

Access the Services

Setup Slack

Create a Slack Incoming Webhook

Configure Grafana to Use the Slack Webhook

Generate Normal Traffic

Verify Metrics in Prometheus

Grafana Dashboard Includes the following

Alerting

Full Cleanup

Please LIKE, COMMENT, and SUBSCRIBE !!!

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages