|
| 1 | +# AgentaFlow Prometheus/Grafana Integration Demo |
| 2 | + |
| 3 | +This demo showcases the complete Prometheus and Grafana integration for AgentaFlow SRO Community Edition, providing enterprise-grade monitoring and visualization for GPU infrastructure. |
| 4 | + |
| 5 | +## 🎯 Overview |
| 6 | + |
| 7 | +The demo demonstrates: |
| 8 | +- Real-time GPU metrics collection and export to Prometheus |
| 9 | +- Comprehensive Grafana dashboards for visualization |
| 10 | +- Kubernetes-native monitoring stack deployment |
| 11 | +- Cost tracking and efficiency analytics |
| 12 | +- Alert management and health monitoring |
| 13 | + |
| 14 | +## 🚀 Quick Start |
| 15 | + |
| 16 | +### 1. Run the Demo Application |
| 17 | + |
| 18 | +```bash |
| 19 | +# From the project root |
| 20 | +cd examples/demo/prometheus-grafana |
| 21 | +go run main.go |
| 22 | +``` |
| 23 | + |
| 24 | +The demo will start and display: |
| 25 | +``` |
| 26 | +🚀 AgentaFlow Prometheus/Grafana Integration Demo |
| 27 | +=============================================== |
| 28 | +📊 Registering Prometheus metrics... |
| 29 | +🔧 Starting services... |
| 30 | +🌐 Starting Prometheus metrics server on :8080... |
| 31 | +✅ All services started successfully! |
| 32 | +
|
| 33 | +🎯 Integration Points: |
| 34 | + • Prometheus metrics: http://localhost:8080/metrics |
| 35 | + • Health endpoint: http://localhost:8080/health |
| 36 | +
|
| 37 | +📊 Available Metrics: |
| 38 | + • agentaflow_gpu_utilization_percent |
| 39 | + • agentaflow_gpu_temperature_celsius |
| 40 | + • agentaflow_gpu_memory_used_bytes |
| 41 | + • agentaflow_gpu_health_status |
| 42 | + • agentaflow_workloads_pending |
| 43 | + • agentaflow_cost_total_dollars |
| 44 | + • agentaflow_gpu_efficiency_score |
| 45 | +``` |
| 46 | + |
| 47 | +### 2. Deploy Monitoring Stack (Kubernetes) |
| 48 | + |
| 49 | +Deploy Prometheus: |
| 50 | +```bash |
| 51 | +kubectl apply -f ../k8s/monitoring/prometheus.yaml |
| 52 | +``` |
| 53 | + |
| 54 | +Deploy Grafana: |
| 55 | +```bash |
| 56 | +kubectl apply -f ../k8s/monitoring/grafana.yaml |
| 57 | +``` |
| 58 | + |
| 59 | +### 3. Access Grafana Dashboard |
| 60 | + |
| 61 | +Port-forward Grafana service: |
| 62 | +```bash |
| 63 | +kubectl port-forward svc/grafana-service 3000:3000 -n agentaflow-monitoring |
| 64 | +``` |
| 65 | + |
| 66 | +Open http://localhost:3000 in your browser: |
| 67 | +- **Username**: `admin` |
| 68 | +- **Password**: `agentaflow123` |
| 69 | + |
| 70 | +### 4. Import Dashboard |
| 71 | + |
| 72 | +1. Go to **Dashboards** > **Import** |
| 73 | +2. Upload `../monitoring/grafana-dashboard.json` |
| 74 | +3. Select Prometheus data source |
| 75 | +4. Click **Import** |
| 76 | + |
| 77 | +## 📊 Metrics Overview |
| 78 | + |
| 79 | +### GPU Metrics |
| 80 | +- **Utilization**: `agentaflow_gpu_utilization_percent` |
| 81 | +- **Temperature**: `agentaflow_gpu_temperature_celsius` |
| 82 | +- **Memory**: `agentaflow_gpu_memory_used_bytes`, `agentaflow_gpu_memory_total_bytes` |
| 83 | +- **Power**: `agentaflow_gpu_power_draw_watts`, `agentaflow_gpu_power_limit_watts` |
| 84 | +- **Clock Speeds**: `agentaflow_gpu_clock_graphics_mhz`, `agentaflow_gpu_clock_memory_mhz` |
| 85 | +- **Health**: `agentaflow_gpu_health_status` |
| 86 | +- **Efficiency**: `agentaflow_gpu_efficiency_score` |
| 87 | + |
| 88 | +### Workload Metrics |
| 89 | +- **Pending Jobs**: `agentaflow_workloads_pending` |
| 90 | +- **Running Jobs**: `agentaflow_workloads_running` |
| 91 | +- **Completed Jobs**: `agentaflow_workloads_completed` |
| 92 | +- **Scheduling Duration**: `agentaflow_scheduling_duration_seconds` |
| 93 | +- **Allocation Efficiency**: `agentaflow_gpu_allocation_efficiency` |
| 94 | + |
| 95 | +### Cost Metrics |
| 96 | +- **Total Cost**: `agentaflow_cost_total_dollars` |
| 97 | +- **Hourly Rates**: `agentaflow_cost_per_hour_dollars` |
| 98 | +- **GPU Hours**: `agentaflow_gpu_hours_consumed` |
| 99 | +- **Monthly Estimates**: `agentaflow_estimated_monthly_cost_dollars` |
| 100 | + |
| 101 | +### System Metrics |
| 102 | +- **Cluster Utilization**: `agentaflow_cluster_utilization_percent` |
| 103 | +- **GPU Availability**: `agentaflow_gpus_available`, `agentaflow_gpus_total` |
| 104 | +- **Component Health**: `agentaflow_component_health_status` |
| 105 | +- **Uptime**: `agentaflow_system_uptime_seconds` |
| 106 | +- **Active Alerts**: `agentaflow_active_alerts` |
| 107 | + |
| 108 | +## 🔧 Configuration |
| 109 | + |
| 110 | +### Prometheus Configuration |
| 111 | +The demo uses these key settings: |
| 112 | +```go |
| 113 | +prometheusConfig := observability.PrometheusConfig{ |
| 114 | + MetricsPrefix: "agentaflow", |
| 115 | + EnabledMetrics: map[string]bool{ |
| 116 | + "gpu_metrics": true, |
| 117 | + "scheduling_metrics": true, |
| 118 | + "serving_metrics": true, |
| 119 | + "cost_metrics": true, |
| 120 | + "system_metrics": true, |
| 121 | + }, |
| 122 | + MetricLabels: map[string]string{ |
| 123 | + "instance": "demo", |
| 124 | + "version": "community", |
| 125 | + }, |
| 126 | +} |
| 127 | +``` |
| 128 | + |
| 129 | +### Alert Thresholds |
| 130 | +```go |
| 131 | +customThresholds := observability.GPUAlertThresholds{ |
| 132 | + HighTemperature: 70.0, |
| 133 | + CriticalTemperature: 85.0, |
| 134 | + HighMemoryUsage: 80.0, |
| 135 | + CriticalMemoryUsage: 95.0, |
| 136 | + HighPowerUsage: 85.0, |
| 137 | + CriticalPowerUsage: 95.0, |
| 138 | + LowUtilization: 15.0, |
| 139 | + HighUtilization: 90.0, |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +### Cost Configuration |
| 144 | +```go |
| 145 | +awsCostConfig := observability.GPUCostConfiguration{ |
| 146 | + CostPerHour: map[string]float64{ |
| 147 | + "a100": 3.06, // AWS p4d.xlarge |
| 148 | + "v100": 3.06, // AWS p3.2xlarge |
| 149 | + "t4": 0.526, // AWS g4dn.xlarge |
| 150 | + "rtx": 1.20, // Custom RTX pricing |
| 151 | + "generic": 1.50, // Default |
| 152 | + }, |
| 153 | + UseUtilizationFactor: true, |
| 154 | + MinUtilizationFactor: 0.15, |
| 155 | + IdleCostReduction: 0.20, |
| 156 | + CloudProvider: "aws", |
| 157 | + Region: "us-west-2", |
| 158 | + SpotInstanceDiscount: 0.60, |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +## 📈 Dashboard Panels |
| 163 | + |
| 164 | +The Grafana dashboard includes 8 comprehensive panels: |
| 165 | + |
| 166 | +1. **GPU Utilization** - Real-time utilization across all GPUs |
| 167 | +2. **Temperature Monitoring** - Temperature trends with thresholds |
| 168 | +3. **Memory Usage** - Memory utilization and availability |
| 169 | +4. **Power Consumption** - Power draw vs limits |
| 170 | +5. **Workload Distribution** - Job scheduling and distribution |
| 171 | +6. **Cost Tracking** - Real-time cost analysis |
| 172 | +7. **System Efficiency** - Performance and efficiency metrics |
| 173 | +8. **Health Status** - Overall system health indicators |
| 174 | + |
| 175 | +## 🎮 Interactive Demo Features |
| 176 | + |
| 177 | +### Real-time Metrics Generation |
| 178 | +The demo generates realistic metrics that simulate: |
| 179 | +- **GPU utilization patterns**: Waves simulating training and inference workloads |
| 180 | +- **Temperature correlation**: Temperature increases with utilization |
| 181 | +- **Memory patterns**: Dynamic memory allocation and release |
| 182 | +- **Cost calculations**: Real-time cost tracking with utilization factors |
| 183 | +- **Alert scenarios**: Periodic alerts to demonstrate monitoring |
| 184 | + |
| 185 | +### Endpoints Available |
| 186 | +- **Metrics**: http://localhost:8080/metrics - Prometheus format metrics |
| 187 | +- **Health**: http://localhost:8080/health - Service health check |
| 188 | +- **Debug**: Live debugging output in terminal |
| 189 | + |
| 190 | +### Metric Patterns |
| 191 | +- **Utilization**: Sine wave pattern (20-80%) simulating workload cycles |
| 192 | +- **Temperature**: Correlated with utilization (35-80°C) |
| 193 | +- **Memory**: Independent allocation patterns per GPU |
| 194 | +- **Workloads**: Periodic job completion and queuing |
| 195 | +- **Costs**: Dynamic cost calculation with AWS pricing |
| 196 | + |
| 197 | +## 🔍 Troubleshooting |
| 198 | + |
| 199 | +### Common Issues |
| 200 | + |
| 201 | +**Metrics not appearing in Prometheus** |
| 202 | +```bash |
| 203 | +# Check if demo is running |
| 204 | +curl http://localhost:8080/metrics |
| 205 | + |
| 206 | +# Verify Prometheus config |
| 207 | +kubectl logs -n agentaflow-monitoring prometheus-deployment-xxx |
| 208 | +``` |
| 209 | + |
| 210 | +**Grafana dashboard not loading** |
| 211 | +```bash |
| 212 | +# Check Grafana logs |
| 213 | +kubectl logs -n agentaflow-monitoring grafana-deployment-xxx |
| 214 | + |
| 215 | +# Verify data source connection |
| 216 | +# Go to Configuration > Data Sources in Grafana UI |
| 217 | +``` |
| 218 | + |
| 219 | +**Kubernetes deployment issues** |
| 220 | +```bash |
| 221 | +# Check namespace |
| 222 | +kubectl get namespace agentaflow-monitoring |
| 223 | + |
| 224 | +# Check all resources |
| 225 | +kubectl get all -n agentaflow-monitoring |
| 226 | + |
| 227 | +# Check ConfigMaps |
| 228 | +kubectl get configmaps -n agentaflow-monitoring |
| 229 | +``` |
| 230 | + |
| 231 | +### Debug Mode |
| 232 | +Enable verbose logging: |
| 233 | +```go |
| 234 | +// Add to main function |
| 235 | +log.SetLevel(log.DebugLevel) |
| 236 | +``` |
| 237 | + |
| 238 | +## 🏗️ Architecture |
| 239 | + |
| 240 | +``` |
| 241 | +┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
| 242 | +│ AgentaFlow │───▶│ Prometheus │───▶│ Grafana │ |
| 243 | +│ GPU Metrics │ │ Exporter │ │ Dashboard │ |
| 244 | +│ Collector │ │ (:8080/metrics) │ │ (localhost:3000)│ |
| 245 | +└─────────────────┘ └──────────────────┘ └─────────────────┘ |
| 246 | + │ │ │ |
| 247 | + ▼ ▼ ▼ |
| 248 | +┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
| 249 | +│ GPU Integration │ │ Metrics Storage │ │ Visualization │ |
| 250 | +│ Service │ │ & Alerting │ │ & Analysis │ |
| 251 | +└─────────────────┘ └──────────────────┘ └─────────────────┘ |
| 252 | +``` |
| 253 | + |
| 254 | +## 🧪 Testing |
| 255 | + |
| 256 | +Run integration tests: |
| 257 | +```bash |
| 258 | +# Test Prometheus metrics endpoint |
| 259 | +curl -s http://localhost:8080/metrics | grep agentaflow |
| 260 | + |
| 261 | +# Test health endpoint |
| 262 | +curl -s http://localhost:8080/health |
| 263 | + |
| 264 | +# Test Grafana API |
| 265 | +curl -s -u admin:agentaflow123 http://localhost:3000/api/health |
| 266 | +``` |
| 267 | + |
| 268 | +## 📚 Next Steps |
| 269 | + |
| 270 | +1. **Production Setup**: Adapt configurations for production environments |
| 271 | +2. **Custom Dashboards**: Create additional dashboards for specific use cases |
| 272 | +3. **Alert Rules**: Implement custom Prometheus alerting rules |
| 273 | +4. **Scaling**: Configure for multi-cluster monitoring |
| 274 | +5. **Integration**: Connect with existing monitoring infrastructure |
| 275 | + |
| 276 | +For production deployment, see the main [README.md](../../README.md) for complete setup instructions. |
0 commit comments