Skip to content

Commit bda0986

Browse files
committed
Hotfixes
1 parent 7fc3ef5 commit bda0986

File tree

6 files changed

+315
-253
lines changed

6 files changed

+315
-253
lines changed

README.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# AgentaFlow SRO Community Edition
22

3-
**AI Infrastructure Tooling & Optimization Platform**
3+
## AI Infrastructure Tooling & Optimization Platform
44

5-
### Author: DeWitt Gibson (@dewitt4)
6-
**Repository**: https://github.com/Finoptimize/agentaflow-sro-community
5+
**Author**: DeWitt Gibson (@dewitt4)
6+
7+
**Repository**: <https://github.com/Finoptimize/agentaflow-sro-community>
78

89

910
Deploy and manage AI infrastructure more efficiently with tools for GPU orchestration, model serving optimization, and comprehensive observability.
@@ -14,22 +15,28 @@ Deploy and manage AI infrastructure more efficiently with tools for GPU orchestr
1415
## 🚀 Features
1516

1617
### GPU Orchestration & Scheduling
18+
1719
Tools that optimize GPU utilization across workloads, reducing waste:
20+
1821
- **Smart Scheduling**: Multiple strategies (least-utilized, best-fit, priority, round-robin)
1922
- **Kubernetes Integration**: Native Kubernetes GPU scheduling with Custom Resource Definitions
2023
- **Resource Optimization**: Reduce GPU idle time by up to 40%
2124
- **Workload Management**: Efficient queuing and distribution across GPU clusters
2225
- **Real-time Monitoring**: Track utilization, memory, temperature, and power
2326

2427
### AI Model Serving Optimization
28+
2529
Software that reduces inference costs through better batching, caching, and routing:
30+
2631
- **Request Batching**: Improve throughput by 3-5x with intelligent batching
2732
- **Smart Caching**: Reduce latency by up to 50% with TTL-based caching
2833
- **Load Balancing**: Multiple routing strategies for optimal distribution
2934
- **Cost Reduction**: Minimize inference costs through efficient resource use
3035

3136
### Observability Tools for AI Systems
37+
3238
Enterprise-grade monitoring, debugging, and cost tracking for LLM applications and training runs:
39+
3340
- **Prometheus Integration**: Production-ready metrics export with 20+ GPU and cost metrics
3441
- **Grafana Dashboards**: Pre-built visual analytics for GPU clusters and cost optimization
3542
- **Real-time Alerting**: Automatic threshold monitoring and notification system
@@ -313,7 +320,8 @@ go run main.go
313320
```
314321

315322
This starts:
316-
- **Prometheus metrics server** on http://localhost:8080/metrics
323+
324+
- **Prometheus metrics server** on <http://localhost:8080/metrics>
317325
- **Real-time GPU monitoring** with automatic export
318326
- **Cost tracking** with live calculations
319327
- **Performance analytics** and efficiency scoring
@@ -337,25 +345,29 @@ kubectl port-forward svc/prometheus-service 9090:9090 -n agentaflow-monitoring
337345
### 🎯 Available Metrics & Dashboards
338346

339347
**GPU Performance Metrics:**
348+
340349
- `agentaflow_gpu_utilization_percent` - Real-time GPU utilization
341350
- `agentaflow_gpu_memory_used_bytes` - Memory consumption tracking
342351
- `agentaflow_gpu_temperature_celsius` - Thermal monitoring
343352
- `agentaflow_gpu_power_draw_watts` - Power consumption tracking
344353
- `agentaflow_gpu_fan_speed_percent` - Cooling system status
345354

346355
**Cost & Efficiency Analytics:**
356+
347357
- `agentaflow_cost_total_dollars` - Real-time cost tracking
348358
- `agentaflow_gpu_efficiency_score` - Efficiency scoring (0-100)
349359
- `agentaflow_gpu_idle_time_percent` - Resource waste tracking
350360
- `agentaflow_cost_per_hour` - Live hourly cost calculation
351361

352362
**Workload & Scheduling Metrics:**
363+
353364
- `agentaflow_workloads_pending` - Job queue depth
354365
- `agentaflow_workloads_completed_total` - Completion tracking
355366
- `agentaflow_scheduler_decisions_total` - Scheduling decisions
356367
- `agentaflow_gpu_assignments_total` - Resource assignments
357368

358369
**System Health & Alerts:**
370+
359371
- Component status monitoring
360372
- Automatic threshold alerts
361373
- Performance trend analysis
@@ -364,6 +376,7 @@ kubectl port-forward svc/prometheus-service 9090:9090 -n agentaflow-monitoring
364376
### 📈 Pre-built Grafana Dashboards
365377

366378
The integration includes production-ready dashboards:
379+
367380
- **GPU Cluster Overview** - Multi-node GPU monitoring
368381
- **Cost Analysis Dashboard** - Real-time cost tracking and forecasting
369382
- **Performance Analytics** - Efficiency scoring and optimization insights
@@ -382,7 +395,7 @@ cd examples/demo/web-dashboard
382395
go run main.go
383396
```
384397

385-
**Access the dashboard**: http://localhost:8090
398+
**Access the dashboard**: <http://localhost:8090>
386399

387400
### ✨ Dashboard Features
388401

@@ -407,6 +420,7 @@ For detailed dashboard documentation, see [examples/demo/web-dashboard/README.md
407420
For detailed documentation, see [DOCUMENTATION.md](DOCUMENTATION.md)
408421

409422
Topics covered:
423+
410424
- Detailed API reference
411425
- Scheduling strategies
412426
- Performance optimization
@@ -451,7 +465,7 @@ Contributions are welcome! This is a community edition focused on providing acce
451465
Looking for advanced features for production environments? Our **Enterprise Edition** will include:
452466

453467
- **Multi-cluster Orchestration**: Manage GPU resources across multiple Kubernetes clusters
454-
- **Multi-cloud GPU resource support**: Support for running in Azure, Google Cloud, Vercel, or DigitalOcean
468+
- **Multi-cloud GPU resource support**: Support for running in Azure, Google Cloud, Vercel, or DigitalOcean
455469
- **Advanced Scheduling Algorithms**: Cost optimization algorithms and priority queues for enterprise workloads
456470
- **RBAC and Audit Logs**: Role-based access control and comprehensive audit logging
457471
- **Enterprise Integrations**: Slack alerts, DataDog monitoring, and other enterprise tools
@@ -467,4 +481,4 @@ For questions, issues, or contributions, please open an issue on GitHub.
467481

468482
---
469483

470-
**Built with ❤️ by FinOptimize for AgentaFlow**
484+
Built with ❤️ by FinOptimize for AgentaFlow

examples/demo/web-dashboard/main.go

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,7 @@ func main() {
5959

6060
// Create web dashboard
6161
fmt.Println("🌐 Setting up web dashboard...")
62-
dashboard := observability.NewWebDashboard(dashboardConfig, monitoringService,
63-
metricsCollector, prometheusExporter)
62+
dashboard := observability.NewWebDashboard(monitoringService, metricsCollector, prometheusExporter, dashboardConfig)
6463

6564
// Start metrics collection
6665
fmt.Println("📡 Starting GPU metrics collection...")
@@ -73,14 +72,8 @@ func main() {
7372

7473
// Generate alerts for demonstration
7574
if metrics.Temperature > 75 {
76-
alert := observability.Alert{
77-
ID: fmt.Sprintf("temp-%s-%d", metrics.GPUID, time.Now().Unix()),
78-
Level: "warning",
79-
Message: fmt.Sprintf("High temperature detected on GPU %s: %.1f°C", metrics.GPUID, metrics.Temperature),
80-
Source: metrics.GPUID,
81-
Timestamp: time.Now(),
82-
}
83-
dashboard.BroadcastAlert(alert)
75+
fmt.Printf("⚠️ High temperature detected on GPU %s: %.1f°C\n", metrics.GPUID, metrics.Temperature)
76+
// Note: Alert broadcasting would be implemented here
8477
}
8578
})
8679

pkg/gpu/metrics_collector.go

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -496,6 +496,23 @@ func (mc *MetricsCollector) ExportMetricsJSON(gpuID string, since time.Time) ([]
496496
return json.MarshalIndent(history, "", " ")
497497
}
498498

499+
// CollectMetrics collects and returns current metrics for the first available GPU
500+
// This method provides backwards compatibility
501+
func (mc *MetricsCollector) CollectMetrics() (*GPUMetrics, error) {
502+
latest := mc.GetLatestMetrics()
503+
504+
if len(latest) == 0 {
505+
return nil, fmt.Errorf("no GPU metrics available")
506+
}
507+
508+
// Return the first available metric
509+
for _, metrics := range latest {
510+
return &metrics, nil
511+
}
512+
513+
return nil, fmt.Errorf("no GPU metrics available")
514+
}
515+
499516
// GetSystemOverview provides a system-wide GPU overview
500517
func (mc *MetricsCollector) GetSystemOverview() map[string]interface{} {
501518
mc.mu.RLock()

0 commit comments

Comments
 (0)