Skip to content

Commit 469e655

Browse files
committed
Updated README for observability
1 parent a047e55 commit 469e655

File tree

1 file changed

+62
-21
lines changed

1 file changed

+62
-21
lines changed

README.md

Lines changed: 62 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,12 @@ Software that reduces inference costs through better batching, caching, and rout
2929
- **Cost Reduction**: Minimize inference costs through efficient resource use
3030

3131
### Observability Tools for AI Systems
32-
Monitoring, debugging, and cost tracking for LLM applications and training runs:
32+
Enterprise-grade monitoring, debugging, and cost tracking for LLM applications and training runs:
33+
- **Prometheus Integration**: Production-ready metrics export with 20+ GPU and cost metrics
34+
- **Grafana Dashboards**: Pre-built visual analytics for GPU clusters and cost optimization
35+
- **Real-time Alerting**: Automatic threshold monitoring and notification system
36+
- **Cost Tracking**: Detailed tracking of GPU hours, tokens, and operational costs with live dashboards
3337
- **Comprehensive Metrics**: Counters, gauges, and histograms for all operations
34-
- **Cost Tracking**: Detailed tracking of GPU hours, tokens, and operational costs
3538
- **Distributed Tracing**: Full request tracing across distributed systems
3639
- **Debug Utilities**: Multi-level logging with performance analysis
3740

@@ -270,6 +273,8 @@ scheduler.SubmitGPUWorkload(workload)
270273
|-----------|---------|--------|
271274
| GPU Scheduling | Optimized utilization | Up to 40% reduction in GPU idle time |
272275
| Real-time Metrics | Live GPU monitoring | Real-time utilization, temperature, power tracking |
276+
| **Prometheus Integration** | **Enterprise monitoring** | **Production-ready metrics export and alerting** |
277+
| **Grafana Dashboards** | **Visual analytics** | **Pre-built dashboards for GPU clusters and cost tracking** |
273278
| GPU Analytics | Performance insights | Efficiency scoring, trend analysis, cost optimization |
274279
| Kubernetes Integration | Native K8s scheduling | Seamless integration with existing clusters |
275280
| Request Batching | Improved throughput | 3-5x increase in requests/second |
@@ -296,40 +301,75 @@ agentaflow-sro-community/
296301

297302
## 🔧 Monitoring & Observability
298303

299-
AgentaFlow provides comprehensive monitoring through Prometheus/Grafana integration:
304+
AgentaFlow provides **enterprise-grade monitoring** through comprehensive Prometheus/Grafana integration with production-ready dashboards and alerting.
300305

301-
### Quick Start Monitoring
306+
### 🚀 Quick Start Monitoring
302307

303-
Run the Prometheus/Grafana demo:
308+
Run the complete Prometheus/Grafana integration demo:
304309

305310
```bash
306311
cd examples/demo/prometheus-grafana
307312
go run main.go
308313
```
309314

310-
Access monitoring:
311-
- **Prometheus Metrics**: http://localhost:8080/metrics
312-
- **Grafana Dashboard**: Deploy with `kubectl apply -f examples/k8s/monitoring/`
315+
This starts:
316+
- **Prometheus metrics server** on http://localhost:8080/metrics
317+
- **Real-time GPU monitoring** with automatic export
318+
- **Cost tracking** with live calculations
319+
- **Performance analytics** and efficiency scoring
313320

314-
### Available Metrics
321+
### 📊 Production Deployment
315322

316-
- **GPU Metrics**: Utilization, temperature, memory, power consumption
317-
- **Cost Tracking**: Real-time cost calculation with cloud pricing integration
318-
- **Workload Metrics**: Job scheduling, queue depth, completion rates
319-
- **System Health**: Component status, alerts, and performance indicators
320-
321-
### Kubernetes Deployment
323+
Deploy the full monitoring stack to Kubernetes:
322324

323325
```bash
324-
# Deploy monitoring stack
326+
# Deploy Prometheus and Grafana
325327
kubectl apply -f examples/k8s/monitoring/prometheus.yaml
326328
kubectl apply -f examples/k8s/monitoring/grafana.yaml
327329

328-
# Access Grafana (admin/agentaflow123)
330+
# Access Grafana dashboard (admin/agentaflow123)
329331
kubectl port-forward svc/grafana-service 3000:3000 -n agentaflow-monitoring
332+
333+
# View Prometheus metrics
334+
kubectl port-forward svc/prometheus-service 9090:9090 -n agentaflow-monitoring
330335
```
331336

332-
For complete monitoring setup, see [examples/demo/PROMETHEUS_GRAFANA_DEMO.md](examples/demo/PROMETHEUS_GRAFANA_DEMO.md)
337+
### 🎯 Available Metrics & Dashboards
338+
339+
**GPU Performance Metrics:**
340+
- `agentaflow_gpu_utilization_percent` - Real-time GPU utilization
341+
- `agentaflow_gpu_memory_used_bytes` - Memory consumption tracking
342+
- `agentaflow_gpu_temperature_celsius` - Thermal monitoring
343+
- `agentaflow_gpu_power_draw_watts` - Power consumption tracking
344+
- `agentaflow_gpu_fan_speed_percent` - Cooling system status
345+
346+
**Cost & Efficiency Analytics:**
347+
- `agentaflow_cost_total_dollars` - Real-time cost tracking
348+
- `agentaflow_gpu_efficiency_score` - Efficiency scoring (0-100)
349+
- `agentaflow_gpu_idle_time_percent` - Resource waste tracking
350+
- `agentaflow_cost_per_hour` - Live hourly cost calculation
351+
352+
**Workload & Scheduling Metrics:**
353+
- `agentaflow_workloads_pending` - Job queue depth
354+
- `agentaflow_workloads_completed_total` - Completion tracking
355+
- `agentaflow_scheduler_decisions_total` - Scheduling decisions
356+
- `agentaflow_gpu_assignments_total` - Resource assignments
357+
358+
**System Health & Alerts:**
359+
- Component status monitoring
360+
- Automatic threshold alerts
361+
- Performance trend analysis
362+
- Resource utilization forecasting
363+
364+
### 📈 Pre-built Grafana Dashboards
365+
366+
The integration includes production-ready dashboards:
367+
- **GPU Cluster Overview** - Multi-node GPU monitoring
368+
- **Cost Analysis Dashboard** - Real-time cost tracking and forecasting
369+
- **Performance Analytics** - Efficiency scoring and optimization insights
370+
- **Alert Management** - Threshold monitoring and notifications
371+
372+
For complete setup guide and advanced configuration, see [examples/demo/PROMETHEUS_GRAFANA_DEMO.md](examples/demo/PROMETHEUS_GRAFANA_DEMO.md)
333373

334374
## 📖 Documentation
335375

@@ -370,9 +410,10 @@ Contributions are welcome! This is a community edition focused on providing acce
370410

371411
- ✅ Kubernetes integration for GPU scheduling
372412
- ✅ Real-time GPU metrics collection
373-
- Prometheus/Grafana integration
374-
- Web dashboard for monitoring
375-
- OpenTelemetry support for tracing
413+
-**Prometheus/Grafana integration** - Complete monitoring stack with dashboards
414+
-**Production-ready observability** - Enterprise-grade metrics export and visualization
415+
- 🔄 Web dashboard for monitoring
416+
- 🔄 OpenTelemetry support for tracing
376417

377418
## 🚀 Enterprise Edition (Coming Soon)
378419

0 commit comments

Comments
 (0)