Skip to content

Commit 1fc05b8

Browse files
authored
Merge branch 'main' into readme
2 parents 8450ceb + a68bf54 commit 1fc05b8

28 files changed

+6758
-52
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,4 @@ examples/model_serving
3838
examples/observability
3939
PROJECT.md
4040
agentaflow-sro-community.code-workspace
41+
docs/screenshots/README.md

CLAUDE.md

Lines changed: 265 additions & 39 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 104 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# AgentaFlow SRO Community Edition
22

3-
**AI Infrastructure Tooling & Optimization Platform**
3+
## AI Infrastructure Tooling & Optimization Platform
44

5-
### Author: DeWitt Gibson (@dewitt4)
6-
**Repository**: https://github.com/Finoptimize/agentaflow-sro-community
5+
**Author**: DeWitt Gibson (@dewitt4)
6+
7+
**Repository**: <https://github.com/Finoptimize/agentaflow-sro-community>
78

89

910
Deploy and manage AI infrastructure more efficiently with tools for GPU orchestration, model serving optimization, and comprehensive observability.
@@ -14,22 +15,28 @@ Deploy and manage AI infrastructure more efficiently with tools for GPU orchestr
1415
## 🚀 Features
1516

1617
### GPU Orchestration & Scheduling
18+
1719
Tools that optimize GPU utilization across workloads, reducing waste:
20+
1821
- **Smart Scheduling**: Multiple strategies (least-utilized, best-fit, priority, round-robin)
1922
- **Kubernetes Integration**: Native Kubernetes GPU scheduling with Custom Resource Definitions
2023
- **Resource Optimization**: Reduce GPU idle time by up to 40%
2124
- **Workload Management**: Efficient queuing and distribution across GPU clusters
2225
- **Real-time Monitoring**: Track utilization, memory, temperature, and power
2326

2427
### AI Model Serving Optimization
28+
2529
Software that reduces inference costs through better batching, caching, and routing:
30+
2631
- **Request Batching**: Improve throughput by 3-5x with intelligent batching
2732
- **Smart Caching**: Reduce latency by up to 50% with TTL-based caching
2833
- **Load Balancing**: Multiple routing strategies for optimal distribution
2934
- **Cost Reduction**: Minimize inference costs through efficient resource use
3035

3136
### Observability Tools for AI Systems
37+
3238
Enterprise-grade monitoring, debugging, and cost tracking for LLM applications and training runs:
39+
3340
- **Prometheus Integration**: Production-ready metrics export with 20+ GPU and cost metrics
3441
- **Grafana Dashboards**: Pre-built visual analytics for GPU clusters and cost optimization
3542
- **Real-time Alerting**: Automatic threshold monitoring and notification system
@@ -38,7 +45,46 @@ Enterprise-grade monitoring, debugging, and cost tracking for LLM applications a
3845
- **Distributed Tracing**: Full request tracing across distributed systems
3946
- **Debug Utilities**: Multi-level logging with performance analysis
4047

41-
## 📦 Installation
48+
## � Screenshots
49+
50+
### 🌐 Web Dashboard Interface
51+
52+
Our production-ready web dashboard provides real-time GPU monitoring with a modern, professional interface:
53+
54+
![Web Dashboard Overview](docs/screenshots/dashboard-overview.png)
55+
*Real-time GPU monitoring dashboard with live metrics, charts, and system overview*
56+
57+
### 📊 Real-time Performance Charts
58+
59+
Interactive Chart.js visualizations show GPU performance trends and cost analytics:
60+
61+
![Performance Charts](docs/screenshots/performance-charts.png)
62+
*GPU utilization and temperature tracking with live cost breakdown analytics*
63+
64+
### 🎯 GPU Metrics Grid
65+
66+
Comprehensive GPU monitoring with individual card status and real-time alerts:
67+
68+
![GPU Metrics Grid](docs/screenshots/gpu-metrics-grid.png)
69+
*Individual GPU monitoring cards showing utilization, temperature, memory usage, and health status*
70+
71+
### 🚨 Alert Management System
72+
73+
Real-time alert system with WebSocket notifications and threshold monitoring:
74+
75+
![Alert Management](docs/screenshots/alert-system.png)
76+
*Live alert feed with temperature warnings, utilization alerts, and memory notifications*
77+
78+
### 📈 System Analytics
79+
80+
Advanced analytics showing efficiency scores, cost tracking, and performance insights:
81+
82+
![System Analytics](docs/screenshots/system-analytics.png)
83+
*System-wide metrics including efficiency scoring, cost per hour, and resource optimization*
84+
85+
> **Demo Ready**: All screenshots show the dashboard running on a local laptop without requiring NVIDIA hardware - perfect for demonstrations and development!
86+
87+
## �📦 Installation
4288

4389
```bash
4490
go get github.com/Finoptimize/agentaflow-sro-community
@@ -299,7 +345,18 @@ agentaflow-sro-community/
299345
└── demo/ # Demo applications
300346
```
301347

302-
## 🔧 Monitoring & Observability
348+
## � Taking Screenshots
349+
350+
To add actual screenshots to this README:
351+
352+
1. Start the demo: `go run examples/demo/web-dashboard/main.go`
353+
2. Open browser to: `http://localhost:9000`
354+
3. Take high-resolution screenshots and save them in `docs/screenshots/`
355+
4. Use the filenames referenced above (dashboard-overview.png, etc.)
356+
357+
For detailed screenshot guidelines, see [docs/screenshots/README.md](docs/screenshots/README.md)
358+
359+
## �🔧 Monitoring & Observability
303360

304361
AgentaFlow provides **enterprise-grade monitoring** through comprehensive Prometheus/Grafana integration with production-ready dashboards and alerting.
305362

@@ -313,7 +370,8 @@ go run main.go
313370
```
314371

315372
This starts:
316-
- **Prometheus metrics server** on http://localhost:8080/metrics
373+
374+
- **Prometheus metrics server** on <http://localhost:8080/metrics>
317375
- **Real-time GPU monitoring** with automatic export
318376
- **Cost tracking** with live calculations
319377
- **Performance analytics** and efficiency scoring
@@ -337,25 +395,29 @@ kubectl port-forward svc/prometheus-service 9090:9090 -n agentaflow-monitoring
337395
### 🎯 Available Metrics & Dashboards
338396

339397
**GPU Performance Metrics:**
398+
340399
- `agentaflow_gpu_utilization_percent` - Real-time GPU utilization
341400
- `agentaflow_gpu_memory_used_bytes` - Memory consumption tracking
342401
- `agentaflow_gpu_temperature_celsius` - Thermal monitoring
343402
- `agentaflow_gpu_power_draw_watts` - Power consumption tracking
344403
- `agentaflow_gpu_fan_speed_percent` - Cooling system status
345404

346405
**Cost & Efficiency Analytics:**
406+
347407
- `agentaflow_cost_total_dollars` - Real-time cost tracking
348408
- `agentaflow_gpu_efficiency_score` - Efficiency scoring (0-100)
349409
- `agentaflow_gpu_idle_time_percent` - Resource waste tracking
350410
- `agentaflow_cost_per_hour` - Live hourly cost calculation
351411

352412
**Workload & Scheduling Metrics:**
413+
353414
- `agentaflow_workloads_pending` - Job queue depth
354415
- `agentaflow_workloads_completed_total` - Completion tracking
355416
- `agentaflow_scheduler_decisions_total` - Scheduling decisions
356417
- `agentaflow_gpu_assignments_total` - Resource assignments
357418

358419
**System Health & Alerts:**
420+
359421
- Component status monitoring
360422
- Automatic threshold alerts
361423
- Performance trend analysis
@@ -364,18 +426,51 @@ kubectl port-forward svc/prometheus-service 9090:9090 -n agentaflow-monitoring
364426
### 📈 Pre-built Grafana Dashboards
365427

366428
The integration includes production-ready dashboards:
429+
367430
- **GPU Cluster Overview** - Multi-node GPU monitoring
368431
- **Cost Analysis Dashboard** - Real-time cost tracking and forecasting
369432
- **Performance Analytics** - Efficiency scoring and optimization insights
370433
- **Alert Management** - Threshold monitoring and notifications
371434

372435
For complete setup guide and advanced configuration, see [examples/demo/PROMETHEUS_GRAFANA_DEMO.md](examples/demo/PROMETHEUS_GRAFANA_DEMO.md)
373436

437+
## 🌐 Interactive Web Dashboard
438+
439+
AgentaFlow now includes a **production-ready web dashboard** for real-time GPU monitoring and system analytics.
440+
441+
### 🚀 Quick Start Web Dashboard
442+
443+
```bash
444+
cd examples/demo/web-dashboard
445+
go run main.go
446+
```
447+
448+
**Access the dashboard**: <http://localhost:8090>
449+
450+
### ✨ Dashboard Features
451+
452+
- **📊 Real-time Monitoring**: Live GPU metrics with WebSocket updates
453+
- **📈 Interactive Charts**: GPU utilization, temperature, and cost analytics
454+
- **🎯 System Overview**: Total GPUs, efficiency scoring, and cost tracking
455+
- **🚨 Alert Management**: Real-time notifications and one-click resolution
456+
- **📱 Responsive Design**: Optimized for desktop, tablet, and mobile
457+
- **🔌 API Integration**: REST endpoints for custom integrations
458+
459+
### 🎯 Use Cases
460+
461+
- **Data Center Operations** - Real-time cluster monitoring
462+
- **Cost Management** - Live cost tracking and optimization
463+
- **Performance Analysis** - Identify bottlenecks and inefficiencies
464+
- **Alert Management** - Proactive issue detection and resolution
465+
466+
For detailed dashboard documentation, see [examples/demo/web-dashboard/README.md](examples/demo/web-dashboard/README.md)
467+
374468
## 📖 Documentation
375469

376470
For detailed documentation, see [DOCUMENTATION.md](DOCUMENTATION.md)
377471

378472
Topics covered:
473+
379474
- Detailed API reference
380475
- Scheduling strategies
381476
- Performance optimization
@@ -412,8 +507,8 @@ Contributions are welcome! This is a community edition focused on providing acce
412507
- ✅ Real-time GPU metrics collection
413508
-**Prometheus/Grafana integration** - Complete monitoring stack with dashboards
414509
-**Production-ready observability** - Enterprise-grade metrics export and visualization
415-
- 🔄 Web dashboard for monitoring
416-
- 🔄 OpenTelemetry support for tracing
510+
- **Web dashboard for monitoring** - Interactive real-time web interface with charts and alerts
511+
- **OpenTelemetry distributed tracing** - Complete tracing integration with Jaeger/OTLP support
417512

418513
## 🚀 Enterprise Edition (Coming Soon)
419514

@@ -438,4 +533,4 @@ For questions, issues, or contributions, please open an issue on GitHub.
438533

439534
---
440535

441-
**Built with ❤️ by FinOptimize for AgentaFlow**
536+
Built with ❤️ by FinOptimize for AgentaFlow

0 commit comments

Comments
 (0)