Skip to content

Commit a68bf54

Browse files
authored
Merge pull request #21 from Finoptimize/web-demo
Adding Web Dashboard demo
2 parents f2b6e6a + 9a725f6 commit a68bf54

File tree

16 files changed

+3108
-72
lines changed

16 files changed

+3108
-72
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,4 @@ examples/model_serving
3838
examples/observability
3939
PROJECT.md
4040
agentaflow-sro-community.code-workspace
41+
docs/screenshots/README.md

README.md

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,46 @@ Enterprise-grade monitoring, debugging, and cost tracking for LLM applications a
4545
- **Distributed Tracing**: Full request tracing across distributed systems
4646
- **Debug Utilities**: Multi-level logging with performance analysis
4747

48-
## 📦 Installation
48+
## � Screenshots
49+
50+
### 🌐 Web Dashboard Interface
51+
52+
Our production-ready web dashboard provides real-time GPU monitoring with a modern, professional interface:
53+
54+
![Web Dashboard Overview](docs/screenshots/dashboard-overview.png)
55+
*Real-time GPU monitoring dashboard with live metrics, charts, and system overview*
56+
57+
### 📊 Real-time Performance Charts
58+
59+
Interactive Chart.js visualizations show GPU performance trends and cost analytics:
60+
61+
![Performance Charts](docs/screenshots/performance-charts.png)
62+
*GPU utilization and temperature tracking with live cost breakdown analytics*
63+
64+
### 🎯 GPU Metrics Grid
65+
66+
Comprehensive GPU monitoring with individual card status and real-time alerts:
67+
68+
![GPU Metrics Grid](docs/screenshots/gpu-metrics-grid.png)
69+
*Individual GPU monitoring cards showing utilization, temperature, memory usage, and health status*
70+
71+
### 🚨 Alert Management System
72+
73+
Real-time alert system with WebSocket notifications and threshold monitoring:
74+
75+
![Alert Management](docs/screenshots/alert-system.png)
76+
*Live alert feed with temperature warnings, utilization alerts, and memory notifications*
77+
78+
### 📈 System Analytics
79+
80+
Advanced analytics showing efficiency scores, cost tracking, and performance insights:
81+
82+
![System Analytics](docs/screenshots/system-analytics.png)
83+
*System-wide metrics including efficiency scoring, cost per hour, and resource optimization*
84+
85+
> **Demo Ready**: All screenshots show the dashboard running on a local laptop without requiring NVIDIA hardware - perfect for demonstrations and development!
86+
87+
## �📦 Installation
4988

5089
```bash
5190
go get github.com/Finoptimize/agentaflow-sro-community
@@ -306,7 +345,18 @@ agentaflow-sro-community/
306345
└── demo/ # Demo applications
307346
```
308347

309-
## 🔧 Monitoring & Observability
348+
## � Taking Screenshots
349+
350+
To add actual screenshots to this README:
351+
352+
1. Start the demo: `go run examples/demo/web-dashboard/main.go`
353+
2. Open browser to: `http://localhost:9000`
354+
3. Take high-resolution screenshots and save them in `docs/screenshots/`
355+
4. Use the filenames referenced above (dashboard-overview.png, etc.)
356+
357+
For detailed screenshot guidelines, see [docs/screenshots/README.md](docs/screenshots/README.md)
358+
359+
## �🔧 Monitoring & Observability
310360

311361
AgentaFlow provides **enterprise-grade monitoring** through comprehensive Prometheus/Grafana integration with production-ready dashboards and alerting.
312362

docs/screenshots/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Screenshots for demo

docs/screenshots/alert-system.png

83.2 KB
Loading
344 KB
Loading
168 KB
Loading
274 KB
Loading
93.2 KB
Loading
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# AgentaFlow Web Dashboard Demo - Complete Setup
2+
3+
## 🎯 Overview
4+
5+
This comprehensive demo showcases a **production-ready GPU monitoring dashboard** that can run on **any laptop** without requiring actual NVIDIA GPUs. The demo simulates realistic GPU workloads and provides a complete monitoring experience identical to what you'd see in a production environment.
6+
7+
## ✨ Key Features Demonstrated
8+
9+
### 🖥️ **Modern Web Dashboard**
10+
- **Responsive design** that works on desktop, tablet, and mobile
11+
- **Real-time charts** using Chart.js with WebSocket updates every 2 seconds
12+
- **Interactive GPU cards** showing utilization, temperature, and memory usage
13+
- **System overview metrics** with efficiency scoring
14+
- **Dark theme** optimized for monitoring environments
15+
16+
### 📊 **Realistic GPU Simulation**
17+
- **4 different GPU types**: RTX 4090, RTX 4080, RTX 4070 Ti, Tesla V100
18+
- **Realistic specifications**: 8GB to 32GB memory, 50W to 400W power consumption
19+
- **Dynamic workload patterns**: Idle → Light Inference → Training → Heavy Inference → Batch Processing
20+
- **Temperature modeling**: Thermal throttling and fan speed curves
21+
- **Memory management**: Realistic allocation patterns
22+
23+
### 💰 **Cost Tracking & Analytics**
24+
- **Multi-operation tracking**: Training, inference, model serving, batch processing
25+
- **Real-time cost calculation** with different rates per operation type
26+
- **Cost forecasting** and optimization recommendations
27+
- **GPU hour tracking** with utilization-based pricing
28+
29+
### 🚨 **Alert Management**
30+
- **Real-time alerts** for temperature (>80°C), utilization (>95%), memory (>90%)
31+
- **WebSocket notifications** broadcast to all connected clients
32+
- **Alert history** and management interface
33+
- **Browser notifications** (when permitted)
34+
35+
### 📈 **Performance Analytics**
36+
- **Trend analysis** for utilization, temperature, and costs
37+
- **Efficiency scoring** based on multiple factors
38+
- **System health monitoring** with comprehensive metrics
39+
- **Historical data** visualization
40+
41+
## 🚀 Running the Demo
42+
43+
### 1. Start the Demo
44+
```bash
45+
cd examples/demo/web-dashboard
46+
go run main.go
47+
```
48+
49+
### 2. Access the Dashboard
50+
- **Web Dashboard**: http://localhost:8090
51+
- **Prometheus Metrics**: http://localhost:8080/metrics
52+
53+
### 3. Explore the Features
54+
- Watch real-time GPU metrics update every 2-3 seconds
55+
- Observe automatic workload pattern changes every 45 seconds
56+
- Check for temperature and utilization alerts
57+
- Monitor cost accumulation over time
58+
- Test WebSocket connectivity (connection status in top-right)
59+
60+
## 🎮 Demo Highlights
61+
62+
### **Simulated Hardware**
63+
```
64+
📊 GPU Fleet:
65+
• gpu-0: NVIDIA GeForce RTX 4090 (24GB VRAM, ~350W)
66+
• gpu-1: NVIDIA GeForce RTX 4080 (16GB VRAM, ~320W)
67+
• gpu-2: NVIDIA GeForce RTX 4070 Ti (12GB VRAM, ~285W)
68+
• gpu-3: NVIDIA Tesla V100 (32GB VRAM, ~300W)
69+
```
70+
71+
### **Workload Patterns**
72+
- **Idle**: 0-15% utilization, minimal memory usage
73+
- **Light Inference**: 20-45% utilization, 30-55% memory
74+
- **Training**: 70-98% utilization, 75-95% memory, high temperature
75+
- **Heavy Inference**: 45-75% utilization, 50-70% memory
76+
- **Batch Processing**: 85-100% utilization, 80-98% memory
77+
78+
### **Alert Triggers**
79+
```
80+
🔥 Temperature Alerts: > 80°C (Critical)
81+
⚡ High Utilization: > 95% (Warning)
82+
💾 Memory Usage: > 90% (Warning)
83+
```
84+
85+
### **Cost Structure**
86+
```
87+
💰 Operation Costs (per GPU hour):
88+
• Training: $2.50/hour
89+
• Inference: $1.80/hour
90+
• Model Serving: $2.00/hour
91+
• Batch Processing: $2.20/hour
92+
```
93+
94+
## 🔧 Technical Implementation
95+
96+
### **Architecture**
97+
```
98+
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
99+
│ Web Dashboard │◄───┤ WebSocket │◄───┤ Mock Collector │
100+
│ (Port 8090) │ │ Real-time │ │ (4 GPUs) │
101+
└─────────────────┘ │ Updates │ └─────────────────┘
102+
└──────────────────┘ │
103+
┌─────────────────┐ ┌──────────────────┐ │
104+
│ Prometheus │◄───┤ Monitoring │◄─────────────┘
105+
│ (Port 8080) │ │ Service │
106+
└─────────────────┘ └──────────────────┘
107+
```
108+
109+
### **Core Components**
110+
1. **MockMetricsCollector**: Generates realistic GPU metrics without hardware
111+
2. **WebDashboard**: Modern HTML5 interface with Chart.js and WebSockets
112+
3. **MonitoringService**: Cost tracking and system health monitoring
113+
4. **PrometheusExporter**: Standard metrics export for observability stack
114+
115+
### **API Endpoints**
116+
```
117+
GET / # Main dashboard interface
118+
GET /ws # WebSocket for real-time updates
119+
GET /health # Health check
120+
GET /api/v1/metrics # Complete metrics data
121+
GET /api/v1/system/stats # System statistics
122+
GET /api/v1/gpus # GPU list and status
123+
GET /api/v1/alerts # Active alerts
124+
GET /api/v1/costs # Cost information
125+
GET /api/v1/performance # Performance analytics
126+
```
127+
128+
## 🌟 Production Readiness Features
129+
130+
### **Scalability**
131+
- **Multi-GPU support**: Easily scales to dozens of GPUs
132+
- **WebSocket management**: Handles multiple concurrent dashboard connections
133+
- **Memory efficient**: Circular buffers with configurable history limits
134+
- **Background processing**: Non-blocking metrics collection
135+
136+
### **Reliability**
137+
- **Graceful shutdown**: Proper cleanup of resources
138+
- **Error handling**: Comprehensive error management
139+
- **Connection recovery**: Automatic WebSocket reconnection
140+
- **Health monitoring**: Self-monitoring and status reporting
141+
142+
### **Integration Ready**
143+
- **Prometheus compatibility**: Standard metrics export
144+
- **REST API**: Complete programmatic access
145+
- **WebSocket API**: Real-time event streaming
146+
- **CORS support**: Cross-origin resource sharing enabled
147+
148+
### **Security Considerations**
149+
- **Input validation**: All API inputs validated
150+
- **Rate limiting**: WebSocket connection limits
151+
- **Origin checking**: Configurable origin validation
152+
- **Logging**: Comprehensive request logging
153+
154+
## 📱 Dashboard Interface Guide
155+
156+
### **System Overview Cards**
157+
- **Total GPUs**: Count of available GPUs
158+
- **Active GPUs**: Number of GPUs with >5% utilization
159+
- **Average Utilization**: Fleet-wide average utilization
160+
- **Efficiency Score**: 0-100 system efficiency rating
161+
- **Total Power**: Aggregate power consumption
162+
- **Memory Usage**: System-wide memory utilization
163+
164+
### **GPU Status Cards**
165+
Each GPU displays:
166+
- **Name and Model**: GPU identification
167+
- **Status Badge**: idle/active/warning/critical
168+
- **Utilization Bar**: Real-time utilization percentage
169+
- **Memory Bar**: Used/Total memory with percentage
170+
- **Temperature Bar**: Current temperature with color coding
171+
172+
### **Performance Charts**
173+
- **GPU Performance**: Line chart showing utilization and temperature trends
174+
- **Cost Analytics**: Doughnut chart breaking down cost categories
175+
- **Time Range Selector**: 1H/6H/24H data views
176+
177+
### **Alerts Panel**
178+
- **Real-time alerts** with severity levels
179+
- **Alert details** including source and timestamp
180+
- **One-click resolution** for alert management
181+
- **Alert counter** in the header
182+
183+
## 🔬 Demo Scenarios
184+
185+
### **Scenario 1: Normal Operations**
186+
- Monitor steady-state workloads
187+
- Observe utilization patterns
188+
- Track cost accumulation
189+
- System efficiency monitoring
190+
191+
### **Scenario 2: High Load Training**
192+
- Watch training workload trigger
193+
- Observe temperature increases
194+
- Monitor memory allocation
195+
- See power consumption rise
196+
197+
### **Scenario 3: Alert Management**
198+
- Wait for temperature >80°C alert
199+
- Observe real-time dashboard notification
200+
- Check alert in alerts panel
201+
- Monitor system response
202+
203+
### **Scenario 4: API Integration**
204+
```bash
205+
# System status
206+
curl http://localhost:8090/api/v1/system/stats
207+
208+
# GPU details
209+
curl http://localhost:8090/api/v1/gpus
210+
211+
# Current alerts
212+
curl http://localhost:8090/api/v1/alerts
213+
214+
# Cost information
215+
curl http://localhost:8090/api/v1/costs
216+
```
217+
218+
## 🛠️ Customization Options
219+
220+
### **GPU Configuration**
221+
Modify `numGPUs` in `main.go` to simulate different cluster sizes:
222+
```go
223+
numGPUs := 8 // Simulate 8 GPUs instead of 4
224+
```
225+
226+
### **Update Intervals**
227+
Adjust refresh rates in `dashboardConfig`:
228+
```go
229+
RefreshInterval: 1000, // 1 second updates
230+
```
231+
232+
### **Alert Thresholds**
233+
Modify alert triggers in the callback function:
234+
```go
235+
if metrics.Temperature > 75 { // Lower temperature threshold
236+
// Generate alert
237+
}
238+
```
239+
240+
### **Cost Rates**
241+
Update cost calculations in the cost tracking goroutine:
242+
```go
243+
cost = gpuHours * 3.50 // Higher training rate
244+
```
245+
246+
## 🎯 Production Deployment Considerations
247+
248+
### **Infrastructure Requirements**
249+
- **CPU**: 2+ cores (4+ recommended for high-throughput)
250+
- **Memory**: 4GB+ RAM (8GB+ for large clusters)
251+
- **Network**: Low latency for WebSocket performance
252+
- **Storage**: Minimal (metrics stored in memory)
253+
254+
### **Scaling Guidelines**
255+
- **Up to 50 GPUs**: Single instance handles easily
256+
- **50-200 GPUs**: Consider connection pooling
257+
- **200+ GPUs**: Implement horizontal scaling
258+
259+
### **Production Enhancements**
260+
- **Authentication**: Add user authentication and authorization
261+
- **TLS/SSL**: Enable HTTPS for production security
262+
- **Database**: Persist historical data to database
263+
- **Caching**: Implement Redis for session management
264+
- **Load Balancing**: Use nginx/HAProxy for multiple instances
265+
266+
## 🌟 Value Proposition
267+
268+
### **For Development Teams**
269+
- **No Hardware Dependencies**: Test monitoring without expensive GPUs
270+
- **Realistic Simulation**: Production-like behavior patterns
271+
- **API Testing**: Complete REST and WebSocket APIs
272+
- **Integration Ready**: Prometheus and standard metrics
273+
274+
### **For Demos & Sales**
275+
- **Impressive Visuals**: Modern, professional dashboard
276+
- **Real-time Updates**: Engaging live demonstrations
277+
- **Comprehensive Features**: Full monitoring stack showcase
278+
- **Easy Setup**: Runs anywhere, no prerequisites
279+
280+
### **For Production Planning**
281+
- **Architecture Preview**: Exact production interface
282+
- **Performance Baseline**: Understanding of metrics and costs
283+
- **Alert Testing**: Comprehensive alerting system
284+
- **Capacity Planning**: Resource usage patterns
285+
286+
## 🚀 Next Steps
287+
288+
1. **Explore the Dashboard**: Spend 10-15 minutes with the live interface
289+
2. **Test API Endpoints**: Use curl or Postman to explore the APIs
290+
3. **Monitor Patterns**: Watch workload changes and alert generation
291+
4. **Check Prometheus**: View metrics at http://localhost:8080/metrics
292+
5. **Customize Settings**: Modify GPU count, thresholds, or update rates
293+
294+
## 📞 Support & Documentation
295+
296+
- **GitHub Repository**: https://github.com/Finoptimize/agentaflow-sro-community
297+
- **API Documentation**: Available at `/api/v1/*` endpoints
298+
- **WebSocket Protocol**: Connect to `/ws` for real-time events
299+
- **Prometheus Metrics**: Standard exposition at `/metrics`
300+
301+
---
302+
303+
**🎉 Congratulations!** You now have a comprehensive GPU monitoring dashboard demo that showcases enterprise-grade monitoring capabilities without requiring any specialized hardware. The demo provides a complete preview of what AgentaFlow SRO Community Edition offers for production GPU infrastructure monitoring.

0 commit comments

Comments
 (0)