|
| 1 | +# AgentaFlow Web Dashboard Demo - Complete Setup |
| 2 | + |
| 3 | +## 🎯 Overview |
| 4 | + |
| 5 | +This comprehensive demo showcases a **production-ready GPU monitoring dashboard** that can run on **any laptop** without requiring actual NVIDIA GPUs. The demo simulates realistic GPU workloads and provides a complete monitoring experience identical to what you'd see in a production environment. |
| 6 | + |
| 7 | +## ✨ Key Features Demonstrated |
| 8 | + |
| 9 | +### 🖥️ **Modern Web Dashboard** |
| 10 | +- **Responsive design** that works on desktop, tablet, and mobile |
| 11 | +- **Real-time charts** using Chart.js with WebSocket updates every 2 seconds |
| 12 | +- **Interactive GPU cards** showing utilization, temperature, and memory usage |
| 13 | +- **System overview metrics** with efficiency scoring |
| 14 | +- **Dark theme** optimized for monitoring environments |
| 15 | + |
| 16 | +### 📊 **Realistic GPU Simulation** |
| 17 | +- **4 different GPU types**: RTX 4090, RTX 4080, RTX 4070 Ti, Tesla V100 |
| 18 | +- **Realistic specifications**: 8GB to 32GB memory, 50W to 400W power consumption |
| 19 | +- **Dynamic workload patterns**: Idle → Light Inference → Training → Heavy Inference → Batch Processing |
| 20 | +- **Temperature modeling**: Thermal throttling and fan speed curves |
| 21 | +- **Memory management**: Realistic allocation patterns |
| 22 | + |
| 23 | +### 💰 **Cost Tracking & Analytics** |
| 24 | +- **Multi-operation tracking**: Training, inference, model serving, batch processing |
| 25 | +- **Real-time cost calculation** with different rates per operation type |
| 26 | +- **Cost forecasting** and optimization recommendations |
| 27 | +- **GPU hour tracking** with utilization-based pricing |
| 28 | + |
| 29 | +### 🚨 **Alert Management** |
| 30 | +- **Real-time alerts** for temperature (>80°C), utilization (>95%), memory (>90%) |
| 31 | +- **WebSocket notifications** broadcast to all connected clients |
| 32 | +- **Alert history** and management interface |
| 33 | +- **Browser notifications** (when permitted) |
| 34 | + |
| 35 | +### 📈 **Performance Analytics** |
| 36 | +- **Trend analysis** for utilization, temperature, and costs |
| 37 | +- **Efficiency scoring** based on multiple factors |
| 38 | +- **System health monitoring** with comprehensive metrics |
| 39 | +- **Historical data** visualization |
| 40 | + |
| 41 | +## 🚀 Running the Demo |
| 42 | + |
| 43 | +### 1. Start the Demo |
| 44 | +```bash |
| 45 | +cd examples/demo/web-dashboard |
| 46 | +go run main.go |
| 47 | +``` |
| 48 | + |
| 49 | +### 2. Access the Dashboard |
| 50 | +- **Web Dashboard**: http://localhost:8090 |
| 51 | +- **Prometheus Metrics**: http://localhost:8080/metrics |
| 52 | + |
| 53 | +### 3. Explore the Features |
| 54 | +- Watch real-time GPU metrics update every 2-3 seconds |
| 55 | +- Observe automatic workload pattern changes every 45 seconds |
| 56 | +- Check for temperature and utilization alerts |
| 57 | +- Monitor cost accumulation over time |
| 58 | +- Test WebSocket connectivity (connection status in top-right) |
| 59 | + |
| 60 | +## 🎮 Demo Highlights |
| 61 | + |
| 62 | +### **Simulated Hardware** |
| 63 | +``` |
| 64 | +📊 GPU Fleet: |
| 65 | + • gpu-0: NVIDIA GeForce RTX 4090 (24GB VRAM, ~350W) |
| 66 | + • gpu-1: NVIDIA GeForce RTX 4080 (16GB VRAM, ~320W) |
| 67 | + • gpu-2: NVIDIA GeForce RTX 4070 Ti (12GB VRAM, ~285W) |
| 68 | + • gpu-3: NVIDIA Tesla V100 (32GB VRAM, ~300W) |
| 69 | +``` |
| 70 | + |
| 71 | +### **Workload Patterns** |
| 72 | +- **Idle**: 0-15% utilization, minimal memory usage |
| 73 | +- **Light Inference**: 20-45% utilization, 30-55% memory |
| 74 | +- **Training**: 70-98% utilization, 75-95% memory, high temperature |
| 75 | +- **Heavy Inference**: 45-75% utilization, 50-70% memory |
| 76 | +- **Batch Processing**: 85-100% utilization, 80-98% memory |
| 77 | + |
| 78 | +### **Alert Triggers** |
| 79 | +``` |
| 80 | +🔥 Temperature Alerts: > 80°C (Critical) |
| 81 | +⚡ High Utilization: > 95% (Warning) |
| 82 | +💾 Memory Usage: > 90% (Warning) |
| 83 | +``` |
| 84 | + |
| 85 | +### **Cost Structure** |
| 86 | +``` |
| 87 | +💰 Operation Costs (per GPU hour): |
| 88 | + • Training: $2.50/hour |
| 89 | + • Inference: $1.80/hour |
| 90 | + • Model Serving: $2.00/hour |
| 91 | + • Batch Processing: $2.20/hour |
| 92 | +``` |
| 93 | + |
| 94 | +## 🔧 Technical Implementation |
| 95 | + |
| 96 | +### **Architecture** |
| 97 | +``` |
| 98 | +┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
| 99 | +│ Web Dashboard │◄───┤ WebSocket │◄───┤ Mock Collector │ |
| 100 | +│ (Port 8090) │ │ Real-time │ │ (4 GPUs) │ |
| 101 | +└─────────────────┘ │ Updates │ └─────────────────┘ |
| 102 | + └──────────────────┘ │ |
| 103 | +┌─────────────────┐ ┌──────────────────┐ │ |
| 104 | +│ Prometheus │◄───┤ Monitoring │◄─────────────┘ |
| 105 | +│ (Port 8080) │ │ Service │ |
| 106 | +└─────────────────┘ └──────────────────┘ |
| 107 | +``` |
| 108 | + |
| 109 | +### **Core Components** |
| 110 | +1. **MockMetricsCollector**: Generates realistic GPU metrics without hardware |
| 111 | +2. **WebDashboard**: Modern HTML5 interface with Chart.js and WebSockets |
| 112 | +3. **MonitoringService**: Cost tracking and system health monitoring |
| 113 | +4. **PrometheusExporter**: Standard metrics export for observability stack |
| 114 | + |
| 115 | +### **API Endpoints** |
| 116 | +``` |
| 117 | +GET / # Main dashboard interface |
| 118 | +GET /ws # WebSocket for real-time updates |
| 119 | +GET /health # Health check |
| 120 | +GET /api/v1/metrics # Complete metrics data |
| 121 | +GET /api/v1/system/stats # System statistics |
| 122 | +GET /api/v1/gpus # GPU list and status |
| 123 | +GET /api/v1/alerts # Active alerts |
| 124 | +GET /api/v1/costs # Cost information |
| 125 | +GET /api/v1/performance # Performance analytics |
| 126 | +``` |
| 127 | + |
| 128 | +## 🌟 Production Readiness Features |
| 129 | + |
| 130 | +### **Scalability** |
| 131 | +- **Multi-GPU support**: Easily scales to dozens of GPUs |
| 132 | +- **WebSocket management**: Handles multiple concurrent dashboard connections |
| 133 | +- **Memory efficient**: Circular buffers with configurable history limits |
| 134 | +- **Background processing**: Non-blocking metrics collection |
| 135 | + |
| 136 | +### **Reliability** |
| 137 | +- **Graceful shutdown**: Proper cleanup of resources |
| 138 | +- **Error handling**: Comprehensive error management |
| 139 | +- **Connection recovery**: Automatic WebSocket reconnection |
| 140 | +- **Health monitoring**: Self-monitoring and status reporting |
| 141 | + |
| 142 | +### **Integration Ready** |
| 143 | +- **Prometheus compatibility**: Standard metrics export |
| 144 | +- **REST API**: Complete programmatic access |
| 145 | +- **WebSocket API**: Real-time event streaming |
| 146 | +- **CORS support**: Cross-origin resource sharing enabled |
| 147 | + |
| 148 | +### **Security Considerations** |
| 149 | +- **Input validation**: All API inputs validated |
| 150 | +- **Rate limiting**: WebSocket connection limits |
| 151 | +- **Origin checking**: Configurable origin validation |
| 152 | +- **Logging**: Comprehensive request logging |
| 153 | + |
| 154 | +## 📱 Dashboard Interface Guide |
| 155 | + |
| 156 | +### **System Overview Cards** |
| 157 | +- **Total GPUs**: Count of available GPUs |
| 158 | +- **Active GPUs**: Number of GPUs with >5% utilization |
| 159 | +- **Average Utilization**: Fleet-wide average utilization |
| 160 | +- **Efficiency Score**: 0-100 system efficiency rating |
| 161 | +- **Total Power**: Aggregate power consumption |
| 162 | +- **Memory Usage**: System-wide memory utilization |
| 163 | + |
| 164 | +### **GPU Status Cards** |
| 165 | +Each GPU displays: |
| 166 | +- **Name and Model**: GPU identification |
| 167 | +- **Status Badge**: idle/active/warning/critical |
| 168 | +- **Utilization Bar**: Real-time utilization percentage |
| 169 | +- **Memory Bar**: Used/Total memory with percentage |
| 170 | +- **Temperature Bar**: Current temperature with color coding |
| 171 | + |
| 172 | +### **Performance Charts** |
| 173 | +- **GPU Performance**: Line chart showing utilization and temperature trends |
| 174 | +- **Cost Analytics**: Doughnut chart breaking down cost categories |
| 175 | +- **Time Range Selector**: 1H/6H/24H data views |
| 176 | + |
| 177 | +### **Alerts Panel** |
| 178 | +- **Real-time alerts** with severity levels |
| 179 | +- **Alert details** including source and timestamp |
| 180 | +- **One-click resolution** for alert management |
| 181 | +- **Alert counter** in the header |
| 182 | + |
| 183 | +## 🔬 Demo Scenarios |
| 184 | + |
| 185 | +### **Scenario 1: Normal Operations** |
| 186 | +- Monitor steady-state workloads |
| 187 | +- Observe utilization patterns |
| 188 | +- Track cost accumulation |
| 189 | +- System efficiency monitoring |
| 190 | + |
| 191 | +### **Scenario 2: High Load Training** |
| 192 | +- Watch training workload trigger |
| 193 | +- Observe temperature increases |
| 194 | +- Monitor memory allocation |
| 195 | +- See power consumption rise |
| 196 | + |
| 197 | +### **Scenario 3: Alert Management** |
| 198 | +- Wait for temperature >80°C alert |
| 199 | +- Observe real-time dashboard notification |
| 200 | +- Check alert in alerts panel |
| 201 | +- Monitor system response |
| 202 | + |
| 203 | +### **Scenario 4: API Integration** |
| 204 | +```bash |
| 205 | +# System status |
| 206 | +curl http://localhost:8090/api/v1/system/stats |
| 207 | + |
| 208 | +# GPU details |
| 209 | +curl http://localhost:8090/api/v1/gpus |
| 210 | + |
| 211 | +# Current alerts |
| 212 | +curl http://localhost:8090/api/v1/alerts |
| 213 | + |
| 214 | +# Cost information |
| 215 | +curl http://localhost:8090/api/v1/costs |
| 216 | +``` |
| 217 | + |
| 218 | +## 🛠️ Customization Options |
| 219 | + |
| 220 | +### **GPU Configuration** |
| 221 | +Modify `numGPUs` in `main.go` to simulate different cluster sizes: |
| 222 | +```go |
| 223 | +numGPUs := 8 // Simulate 8 GPUs instead of 4 |
| 224 | +``` |
| 225 | + |
| 226 | +### **Update Intervals** |
| 227 | +Adjust refresh rates in `dashboardConfig`: |
| 228 | +```go |
| 229 | +RefreshInterval: 1000, // 1 second updates |
| 230 | +``` |
| 231 | + |
| 232 | +### **Alert Thresholds** |
| 233 | +Modify alert triggers in the callback function: |
| 234 | +```go |
| 235 | +if metrics.Temperature > 75 { // Lower temperature threshold |
| 236 | + // Generate alert |
| 237 | +} |
| 238 | +``` |
| 239 | + |
| 240 | +### **Cost Rates** |
| 241 | +Update cost calculations in the cost tracking goroutine: |
| 242 | +```go |
| 243 | +cost = gpuHours * 3.50 // Higher training rate |
| 244 | +``` |
| 245 | + |
| 246 | +## 🎯 Production Deployment Considerations |
| 247 | + |
| 248 | +### **Infrastructure Requirements** |
| 249 | +- **CPU**: 2+ cores (4+ recommended for high-throughput) |
| 250 | +- **Memory**: 4GB+ RAM (8GB+ for large clusters) |
| 251 | +- **Network**: Low latency for WebSocket performance |
| 252 | +- **Storage**: Minimal (metrics stored in memory) |
| 253 | + |
| 254 | +### **Scaling Guidelines** |
| 255 | +- **Up to 50 GPUs**: Single instance handles easily |
| 256 | +- **50-200 GPUs**: Consider connection pooling |
| 257 | +- **200+ GPUs**: Implement horizontal scaling |
| 258 | + |
| 259 | +### **Production Enhancements** |
| 260 | +- **Authentication**: Add user authentication and authorization |
| 261 | +- **TLS/SSL**: Enable HTTPS for production security |
| 262 | +- **Database**: Persist historical data to database |
| 263 | +- **Caching**: Implement Redis for session management |
| 264 | +- **Load Balancing**: Use nginx/HAProxy for multiple instances |
| 265 | + |
| 266 | +## 🌟 Value Proposition |
| 267 | + |
| 268 | +### **For Development Teams** |
| 269 | +- **No Hardware Dependencies**: Test monitoring without expensive GPUs |
| 270 | +- **Realistic Simulation**: Production-like behavior patterns |
| 271 | +- **API Testing**: Complete REST and WebSocket APIs |
| 272 | +- **Integration Ready**: Prometheus and standard metrics |
| 273 | + |
| 274 | +### **For Demos & Sales** |
| 275 | +- **Impressive Visuals**: Modern, professional dashboard |
| 276 | +- **Real-time Updates**: Engaging live demonstrations |
| 277 | +- **Comprehensive Features**: Full monitoring stack showcase |
| 278 | +- **Easy Setup**: Runs anywhere, no prerequisites |
| 279 | + |
| 280 | +### **For Production Planning** |
| 281 | +- **Architecture Preview**: Exact production interface |
| 282 | +- **Performance Baseline**: Understanding of metrics and costs |
| 283 | +- **Alert Testing**: Comprehensive alerting system |
| 284 | +- **Capacity Planning**: Resource usage patterns |
| 285 | + |
| 286 | +## 🚀 Next Steps |
| 287 | + |
| 288 | +1. **Explore the Dashboard**: Spend 10-15 minutes with the live interface |
| 289 | +2. **Test API Endpoints**: Use curl or Postman to explore the APIs |
| 290 | +3. **Monitor Patterns**: Watch workload changes and alert generation |
| 291 | +4. **Check Prometheus**: View metrics at http://localhost:8080/metrics |
| 292 | +5. **Customize Settings**: Modify GPU count, thresholds, or update rates |
| 293 | + |
| 294 | +## 📞 Support & Documentation |
| 295 | + |
| 296 | +- **GitHub Repository**: https://github.com/Finoptimize/agentaflow-sro-community |
| 297 | +- **API Documentation**: Available at `/api/v1/*` endpoints |
| 298 | +- **WebSocket Protocol**: Connect to `/ws` for real-time events |
| 299 | +- **Prometheus Metrics**: Standard exposition at `/metrics` |
| 300 | + |
| 301 | +--- |
| 302 | + |
| 303 | +**🎉 Congratulations!** You now have a comprehensive GPU monitoring dashboard demo that showcases enterprise-grade monitoring capabilities without requiring any specialized hardware. The demo provides a complete preview of what AgentaFlow SRO Community Edition offers for production GPU infrastructure monitoring. |
0 commit comments