Skip to content

Commit 3c42d24

Browse files
authored
Merge pull request #14 from Finoptimize/prometheus-grafana
Adding observability
2 parents 167f671 + 071d555 commit 3c42d24

File tree

12 files changed

+2809
-11
lines changed

12 files changed

+2809
-11
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,5 @@ go.work.sum
3636
examples/gpu_scheduling
3737
examples/model_serving
3838
examples/observability
39-
PROJECT.md
39+
PROJECT.md
40+
agentaflow-sro-community.code-workspace

README.md

Lines changed: 78 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,44 @@ fmt.Printf("GPU efficiency: %.1f%% idle time, %.3f power efficiency\n",
149149
efficiency["idle_time_percent"], efficiency["avg_power_efficiency"])
150150
```
151151

152+
### Prometheus/Grafana Integration
153+
154+
```go
155+
import "github.com/Finoptimize/agentaflow-sro-community/pkg/observability"
156+
157+
// Create Prometheus exporter
158+
prometheusConfig := observability.PrometheusConfig{
159+
MetricsPrefix: "agentaflow",
160+
EnabledMetrics: map[string]bool{
161+
"gpu_metrics": true,
162+
"scheduling_metrics": true,
163+
"serving_metrics": true,
164+
"cost_metrics": true,
165+
"system_metrics": true,
166+
},
167+
}
168+
exporter := observability.NewPrometheusExporter(monitoringService, prometheusConfig)
169+
170+
// Register GPU metrics for export
171+
exporter.RegisterGPUMetrics()
172+
exporter.RegisterCostMetrics()
173+
exporter.RegisterSchedulingMetrics()
174+
175+
// Start metrics server for Prometheus scraping
176+
go exporter.StartMetricsServer(8080)
177+
178+
// Enable GPU integration with Prometheus export
179+
integration.SetPrometheusExporter(exporter)
180+
integration.EnablePrometheusExport(true)
181+
182+
// Metrics available at http://localhost:8080/metrics
183+
// - agentaflow_gpu_utilization_percent
184+
// - agentaflow_gpu_temperature_celsius
185+
// - agentaflow_gpu_memory_used_bytes
186+
// - agentaflow_cost_total_dollars
187+
// - agentaflow_workloads_pending
188+
```
189+
152190
### Advanced GPU Analytics
153191

154192
```go
@@ -240,7 +278,7 @@ scheduler.SubmitGPUWorkload(workload)
240278
241279
## 🏗️ Architecture
242280
243-
```
281+
```bash
244282
agentaflow-sro-community/
245283
├── pkg/
246284
│ ├── gpu/ # GPU orchestration and scheduling
@@ -252,9 +290,47 @@ agentaflow-sro-community/
252290
│ └── k8s-gpu-scheduler/ # Kubernetes GPU scheduler
253291
└── examples/
254292
├── k8s/ # Kubernetes deployment examples
255-
└── ... # Other usage examples
293+
├── monitoring/ # Grafana dashboards and configs
294+
└── demo/ # Demo applications
256295
```
257296

297+
## 🔧 Monitoring & Observability
298+
299+
AgentaFlow provides comprehensive monitoring through Prometheus/Grafana integration:
300+
301+
### Quick Start Monitoring
302+
303+
Run the Prometheus/Grafana demo:
304+
305+
```bash
306+
cd examples/demo/prometheus-grafana
307+
go run main.go
308+
```
309+
310+
Access monitoring:
311+
- **Prometheus Metrics**: http://localhost:8080/metrics
312+
- **Grafana Dashboard**: Deploy with `kubectl apply -f examples/k8s/monitoring/`
313+
314+
### Available Metrics
315+
316+
- **GPU Metrics**: Utilization, temperature, memory, power consumption
317+
- **Cost Tracking**: Real-time cost calculation with cloud pricing integration
318+
- **Workload Metrics**: Job scheduling, queue depth, completion rates
319+
- **System Health**: Component status, alerts, and performance indicators
320+
321+
### Kubernetes Deployment
322+
323+
```bash
324+
# Deploy monitoring stack
325+
kubectl apply -f examples/k8s/monitoring/prometheus.yaml
326+
kubectl apply -f examples/k8s/monitoring/grafana.yaml
327+
328+
# Access Grafana (admin/agentaflow123)
329+
kubectl port-forward svc/grafana-service 3000:3000 -n agentaflow-monitoring
330+
```
331+
332+
For complete monitoring setup, see [examples/demo/PROMETHEUS_GRAFANA_DEMO.md](examples/demo/PROMETHEUS_GRAFANA_DEMO.md)
333+
258334
## 📖 Documentation
259335

260336
For detailed documentation, see [DOCUMENTATION.md](DOCUMENTATION.md)
Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# AgentaFlow Prometheus/Grafana Integration Demo
2+
3+
This demo showcases the complete Prometheus and Grafana integration for AgentaFlow SRO Community Edition, providing enterprise-grade monitoring and visualization for GPU infrastructure.
4+
5+
## 🎯 Overview
6+
7+
The demo demonstrates:
8+
- Real-time GPU metrics collection and export to Prometheus
9+
- Comprehensive Grafana dashboards for visualization
10+
- Kubernetes-native monitoring stack deployment
11+
- Cost tracking and efficiency analytics
12+
- Alert management and health monitoring
13+
14+
## 🚀 Quick Start
15+
16+
### 1. Run the Demo Application
17+
18+
```bash
19+
# From the project root
20+
cd examples/demo/prometheus-grafana
21+
go run main.go
22+
```
23+
24+
The demo will start and display:
25+
```
26+
🚀 AgentaFlow Prometheus/Grafana Integration Demo
27+
===============================================
28+
📊 Registering Prometheus metrics...
29+
🔧 Starting services...
30+
🌐 Starting Prometheus metrics server on :8080...
31+
✅ All services started successfully!
32+
33+
🎯 Integration Points:
34+
• Prometheus metrics: http://localhost:8080/metrics
35+
• Health endpoint: http://localhost:8080/health
36+
37+
📊 Available Metrics:
38+
• agentaflow_gpu_utilization_percent
39+
• agentaflow_gpu_temperature_celsius
40+
• agentaflow_gpu_memory_used_bytes
41+
• agentaflow_gpu_health_status
42+
• agentaflow_workloads_pending
43+
• agentaflow_cost_total_dollars
44+
• agentaflow_gpu_efficiency_score
45+
```
46+
47+
### 2. Deploy Monitoring Stack (Kubernetes)
48+
49+
Deploy Prometheus:
50+
```bash
51+
kubectl apply -f ../k8s/monitoring/prometheus.yaml
52+
```
53+
54+
Deploy Grafana:
55+
```bash
56+
kubectl apply -f ../k8s/monitoring/grafana.yaml
57+
```
58+
59+
### 3. Access Grafana Dashboard
60+
61+
Port-forward Grafana service:
62+
```bash
63+
kubectl port-forward svc/grafana-service 3000:3000 -n agentaflow-monitoring
64+
```
65+
66+
Open http://localhost:3000 in your browser:
67+
- **Username**: `admin`
68+
- **Password**: `agentaflow123`
69+
70+
### 4. Import Dashboard
71+
72+
1. Go to **Dashboards** > **Import**
73+
2. Upload `../monitoring/grafana-dashboard.json`
74+
3. Select Prometheus data source
75+
4. Click **Import**
76+
77+
## 📊 Metrics Overview
78+
79+
### GPU Metrics
80+
- **Utilization**: `agentaflow_gpu_utilization_percent`
81+
- **Temperature**: `agentaflow_gpu_temperature_celsius`
82+
- **Memory**: `agentaflow_gpu_memory_used_bytes`, `agentaflow_gpu_memory_total_bytes`
83+
- **Power**: `agentaflow_gpu_power_draw_watts`, `agentaflow_gpu_power_limit_watts`
84+
- **Clock Speeds**: `agentaflow_gpu_clock_graphics_mhz`, `agentaflow_gpu_clock_memory_mhz`
85+
- **Health**: `agentaflow_gpu_health_status`
86+
- **Efficiency**: `agentaflow_gpu_efficiency_score`
87+
88+
### Workload Metrics
89+
- **Pending Jobs**: `agentaflow_workloads_pending`
90+
- **Running Jobs**: `agentaflow_workloads_running`
91+
- **Completed Jobs**: `agentaflow_workloads_completed`
92+
- **Scheduling Duration**: `agentaflow_scheduling_duration_seconds`
93+
- **Allocation Efficiency**: `agentaflow_gpu_allocation_efficiency`
94+
95+
### Cost Metrics
96+
- **Total Cost**: `agentaflow_cost_total_dollars`
97+
- **Hourly Rates**: `agentaflow_cost_per_hour_dollars`
98+
- **GPU Hours**: `agentaflow_gpu_hours_consumed`
99+
- **Monthly Estimates**: `agentaflow_estimated_monthly_cost_dollars`
100+
101+
### System Metrics
102+
- **Cluster Utilization**: `agentaflow_cluster_utilization_percent`
103+
- **GPU Availability**: `agentaflow_gpus_available`, `agentaflow_gpus_total`
104+
- **Component Health**: `agentaflow_component_health_status`
105+
- **Uptime**: `agentaflow_system_uptime_seconds`
106+
- **Active Alerts**: `agentaflow_active_alerts`
107+
108+
## 🔧 Configuration
109+
110+
### Prometheus Configuration
111+
The demo uses these key settings:
112+
```go
113+
prometheusConfig := observability.PrometheusConfig{
114+
MetricsPrefix: "agentaflow",
115+
EnabledMetrics: map[string]bool{
116+
"gpu_metrics": true,
117+
"scheduling_metrics": true,
118+
"serving_metrics": true,
119+
"cost_metrics": true,
120+
"system_metrics": true,
121+
},
122+
MetricLabels: map[string]string{
123+
"instance": "demo",
124+
"version": "community",
125+
},
126+
}
127+
```
128+
129+
### Alert Thresholds
130+
```go
131+
customThresholds := observability.GPUAlertThresholds{
132+
HighTemperature: 70.0,
133+
CriticalTemperature: 85.0,
134+
HighMemoryUsage: 80.0,
135+
CriticalMemoryUsage: 95.0,
136+
HighPowerUsage: 85.0,
137+
CriticalPowerUsage: 95.0,
138+
LowUtilization: 15.0,
139+
HighUtilization: 90.0,
140+
}
141+
```
142+
143+
### Cost Configuration
144+
```go
145+
awsCostConfig := observability.GPUCostConfiguration{
146+
CostPerHour: map[string]float64{
147+
"a100": 3.06, // AWS p4d.xlarge
148+
"v100": 3.06, // AWS p3.2xlarge
149+
"t4": 0.526, // AWS g4dn.xlarge
150+
"rtx": 1.20, // Custom RTX pricing
151+
"generic": 1.50, // Default
152+
},
153+
UseUtilizationFactor: true,
154+
MinUtilizationFactor: 0.15,
155+
IdleCostReduction: 0.20,
156+
CloudProvider: "aws",
157+
Region: "us-west-2",
158+
SpotInstanceDiscount: 0.60,
159+
}
160+
```
161+
162+
## 📈 Dashboard Panels
163+
164+
The Grafana dashboard includes 8 comprehensive panels:
165+
166+
1. **GPU Utilization** - Real-time utilization across all GPUs
167+
2. **Temperature Monitoring** - Temperature trends with thresholds
168+
3. **Memory Usage** - Memory utilization and availability
169+
4. **Power Consumption** - Power draw vs limits
170+
5. **Workload Distribution** - Job scheduling and distribution
171+
6. **Cost Tracking** - Real-time cost analysis
172+
7. **System Efficiency** - Performance and efficiency metrics
173+
8. **Health Status** - Overall system health indicators
174+
175+
## 🎮 Interactive Demo Features
176+
177+
### Real-time Metrics Generation
178+
The demo generates realistic metrics that simulate:
179+
- **GPU utilization patterns**: Waves simulating training and inference workloads
180+
- **Temperature correlation**: Temperature increases with utilization
181+
- **Memory patterns**: Dynamic memory allocation and release
182+
- **Cost calculations**: Real-time cost tracking with utilization factors
183+
- **Alert scenarios**: Periodic alerts to demonstrate monitoring
184+
185+
### Endpoints Available
186+
- **Metrics**: http://localhost:8080/metrics - Prometheus format metrics
187+
- **Health**: http://localhost:8080/health - Service health check
188+
- **Debug**: Live debugging output in terminal
189+
190+
### Metric Patterns
191+
- **Utilization**: Sine wave pattern (20-80%) simulating workload cycles
192+
- **Temperature**: Correlated with utilization (35-80°C)
193+
- **Memory**: Independent allocation patterns per GPU
194+
- **Workloads**: Periodic job completion and queuing
195+
- **Costs**: Dynamic cost calculation with AWS pricing
196+
197+
## 🔍 Troubleshooting
198+
199+
### Common Issues
200+
201+
**Metrics not appearing in Prometheus**
202+
```bash
203+
# Check if demo is running
204+
curl http://localhost:8080/metrics
205+
206+
# Verify Prometheus config
207+
kubectl logs -n agentaflow-monitoring prometheus-deployment-xxx
208+
```
209+
210+
**Grafana dashboard not loading**
211+
```bash
212+
# Check Grafana logs
213+
kubectl logs -n agentaflow-monitoring grafana-deployment-xxx
214+
215+
# Verify data source connection
216+
# Go to Configuration > Data Sources in Grafana UI
217+
```
218+
219+
**Kubernetes deployment issues**
220+
```bash
221+
# Check namespace
222+
kubectl get namespace agentaflow-monitoring
223+
224+
# Check all resources
225+
kubectl get all -n agentaflow-monitoring
226+
227+
# Check ConfigMaps
228+
kubectl get configmaps -n agentaflow-monitoring
229+
```
230+
231+
### Debug Mode
232+
Enable verbose logging:
233+
```go
234+
// Add to main function
235+
log.SetLevel(log.DebugLevel)
236+
```
237+
238+
## 🏗️ Architecture
239+
240+
```
241+
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
242+
│ AgentaFlow │───▶│ Prometheus │───▶│ Grafana │
243+
│ GPU Metrics │ │ Exporter │ │ Dashboard │
244+
│ Collector │ │ (:8080/metrics) │ │ (localhost:3000)│
245+
└─────────────────┘ └──────────────────┘ └─────────────────┘
246+
│ │ │
247+
▼ ▼ ▼
248+
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
249+
│ GPU Integration │ │ Metrics Storage │ │ Visualization │
250+
│ Service │ │ & Alerting │ │ & Analysis │
251+
└─────────────────┘ └──────────────────┘ └─────────────────┘
252+
```
253+
254+
## 🧪 Testing
255+
256+
Run integration tests:
257+
```bash
258+
# Test Prometheus metrics endpoint
259+
curl -s http://localhost:8080/metrics | grep agentaflow
260+
261+
# Test health endpoint
262+
curl -s http://localhost:8080/health
263+
264+
# Test Grafana API
265+
curl -s -u admin:agentaflow123 http://localhost:3000/api/health
266+
```
267+
268+
## 📚 Next Steps
269+
270+
1. **Production Setup**: Adapt configurations for production environments
271+
2. **Custom Dashboards**: Create additional dashboards for specific use cases
272+
3. **Alert Rules**: Implement custom Prometheus alerting rules
273+
4. **Scaling**: Configure for multi-cluster monitoring
274+
5. **Integration**: Connect with existing monitoring infrastructure
275+
276+
For production deployment, see the main [README.md](../../README.md) for complete setup instructions.

0 commit comments

Comments
 (0)