In this project, we give a practical, end-to-end MLOps project that detects data / concept drift, exports drift metrics to Prometheus, visualizes & alerts in Grafana, Alertmanager, and Slack.
- Model Service (FAstAPI): Makes predictions, detects drift, and exposes "/metrics" endpoint.
- Prometheus: Comprehensively collects, stores metrics & evaluates alert rules.
- Grafana: Visualizes metrics on an Interactive UI (dashboard).
- Real-time Drift Detection: Statistical Methods (Kolmogorov Smirnov(KS test)-Numerical data, Population Stability Index(PSI)-Categorical data)
- Alertmanager: Sends alerts to Slack when drift is detected.
- Slack: Messaging platform for receiving alerts (on phone, laptop, etc.) when drift exceeds threshold.
- Docker Compose: For easy deployment of Docker Containers for the Services.
Create the necessary files and directories in the project root directory:
mkdir -p drift-monitoring/{prometheus,grafana/provisioning/{datasources,dashboards},model-service,alertmanager}
cd drift-monitoring- app.py: Main API with endpoints
- drift_detector.py: KS Test & PSI algorithms
- data_generator.py: Synthetic data simulation
- Real-time drift detection
Model Service:
- GET / - Service status
- POST /predict - Make prediction
- GET /drift/status - Current drift status
- POST /simulate/drift - Simulate drift for testing
- GET /metrics - Prometheus metrics
- GET /health - Health check
touch .env docker-compose.yml# Build and start all services
docker-compose up -d --build
# Wait for all services to be ready (30 seconds)
sleep 30
# Check running docker-compose processes
docker-compose ps
# Check logs
docker-compose logs -f
# Check logs of individual service (model-service, prometheus, grafana, alertmanager)
docker-compose logs -f model-service
# Stop the running docker processes
docker-compose down -v
# Remove everything including images
docker-compose down -v --rmi all# Check model service
curl http://localhost:8000/health
# Check Prometheus
curl http://localhost:9090/-/healthy
# Check Grafana
curl http://localhost:3000/api/health# Model Service API
http://localhost:8000
# Prometheus
http://localhost:9090
# Grafana (username: admin, password: admin)
http://localhost:3000
# Alertmanager
http://localhost:9093Install, create account and sign-in to your Slack Account.
sudo snap install slack --classic- Go to your Slack workspace. Click File - Settings & Administration - Manage apps.
- In the Search bar on the top right, search and open "Incoming Webhooks".
- Click "Add to Slack".
- Scroll down to "Post to Channel" and select the channel where you want to post alerts (or Create a New Channel).
- Click "Add Incoming Webhooks Integration".
- Copy & Save the Webhook URL, e.g.: https://hooks.slack.com/services/T09NYD7D30R/B09NHJYA5GE/u2he99h3f79h23hy9rK
- Add SLACK_WEBHOOK_URL=https://hooks.slack.com/services/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uZj to the .env file.
- Add slack_api_url: 'https://hooks.slack.com/services/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Zj' to the alertmanager.yml.
We will generate normal traffic from the ML Model using the "test_normal.sh" script to make some predictions.
# Make the shell script executable and run it
chmod +x test_normal.sh./test_normal.shOn the Prometheus page, try the following queries:
model_drift_score
model_drift_score{method="ks_test"}
model_drift_score{method="psi"}
# Should show firing alerts
ALERTS{alertstate="firing"}
# Should show specific drift alerts
ALERTS{alertname="DataDriftDetected"}Verify on Grafana Dashboard as well.
- Drift Scores by Features: Bar guage showing current drift
- Drift Score Over Time: Time series of drift evolution
- Prediction Rate: Predictions per second
- Drift Alerts: Counts of alerts in last hour
- Total Predictions: Cummulative count of predictions
- Feature Means: Statistical tracking
- Feature Std Deviations: Variance monitoring
- PSI Scores: Alternative drift metric
- Prediction Latency: Performance monitoring
Alerts are triggered when:
- Data Drift Detected: Drift score > 0.3 for 1 minute
- Critical Data Drift: Drift score > 0.5 for 30 seconds
- Prediction Distribution Shift: Rate drops significantly
- Model Service Down: Service unreachable for 1 minute
Alerts are sent to Slack with:
- Alert name and severity
- Drift score and threshold
- Feature name
- Recommended actions
To do a total cleanup of everything, use the cleanup.sh script:
chmod +x cleanup.sh
./cleanup.sh