Skip to content

Commit 0d0a73c

Browse files
authored
Feature/redesign (#53)
## Summary - Redesigned the dashboard with a sidebar-based layout, replacing the old flat single-page structure - Split monolithic `styles.css` into modular CSS: `tokens.css`, `layout.css`, `components.css` - Added primary/secondary metric hierarchy with visual weight for the detail view - Added chart drawer for click-to-expand correlation analysis between metrics - Enhanced single-GPU overview on the All page
1 parent 74abcc7 commit 0d0a73c

File tree

17 files changed

+3990
-3519
lines changed

17 files changed

+3990
-3519
lines changed

README.md

Lines changed: 33 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# GPU Hot
44

5-
Real-time NVIDIA GPU monitoring dashboard. Web-based, no SSH required.
5+
Real-time NVIDIA GPU monitoring dashboard. Lightweight, web-based, and self-hosted.
66

77
[![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
88
[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/)
@@ -76,7 +76,7 @@ NODE_URLS=http://host:1312... # Comma-separated node URLs (required for hub mod
7676

7777
**Backend (`core/config.py`):**
7878
```python
79-
UPDATE_INTERVAL = 0.5 # Polling interval
79+
UPDATE_INTERVAL = 0.5 # Polling interval in seconds
8080
PORT = 1312 # Server port
8181
```
8282

@@ -87,42 +87,58 @@ PORT = 1312 # Server port
8787
### HTTP
8888
```bash
8989
GET / # Dashboard
90-
GET /api/gpu-data # JSON metrics
90+
GET /api/gpu-data # JSON metrics snapshot
91+
GET /api/version # Version and update info
9192
```
9293

9394
### WebSocket
9495
```javascript
95-
socket.on('gpu_data', (data) => {
96-
// Updates every 0.5s (configurable)
97-
// Contains: data.gpus, data.processes, data.system
98-
});
96+
const ws = new WebSocket('ws://localhost:1312/socket.io/');
97+
98+
ws.onmessage = (event) => {
99+
const data = JSON.parse(event.data);
100+
// data.gpus — per-GPU metrics
101+
// data.processes — active GPU processes
102+
// data.system — host CPU, RAM, swap, disk, network
103+
};
99104
```
105+
100106
---
101107

102108
## Project Structure
103109

104-
```bash
110+
```
105111
gpu-hot/
106-
├── app.py # Flask + WebSocket server
112+
├── app.py # FastAPI server + routes
113+
├── version.py # Version info
107114
├── core/
108115
│ ├── config.py # Configuration
109116
│ ├── monitor.py # NVML GPU monitoring
110117
│ ├── handlers.py # WebSocket handlers
111-
│ ├── routes.py # HTTP routes
118+
│ ├── hub.py # Multi-node hub aggregator
119+
│ ├── hub_handlers.py # Hub WebSocket handlers
120+
│ ├── nvidia_smi_fallback.py # nvidia-smi fallback for older GPUs
112121
│ └── metrics/
113122
│ ├── collector.py # Metrics collection
114123
│ └── utils.py # Metric utilities
115124
├── static/
125+
│ ├── css/
126+
│ │ ├── tokens.css # Design tokens (colors, spacing)
127+
│ │ ├── layout.css # Page layout (sidebar, main)
128+
│ │ └── components.css # UI components (cards, charts)
116129
│ ├── js/
117-
│ │ ├── charts.js # Chart configs
118-
│ │ ├── gpu-cards.js # UI components
119-
│ │ ├── socket-handlers.js # WebSocket + rendering
120-
│ │ ├── ui.js # View management
121-
│ │ └── app.js # Init
122-
│ └── css/styles.css
130+
│ │ ├── chart-config.js # Chart.js configurations
131+
│ │ ├── chart-manager.js # Chart data + lifecycle
132+
│ │ ├── chart-drawer.js # Correlation drawer
133+
│ │ ├── gpu-cards.js # GPU card rendering
134+
│ │ ├── socket-handlers.js # WebSocket + batched rendering
135+
│ │ ├── ui.js # Sidebar navigation
136+
│ │ └── app.js # Init + version check
137+
│ └── favicon.svg
123138
├── templates/index.html
124139
├── Dockerfile
125-
└── docker-compose.yml
140+
├── docker-compose.yml
141+
└── requirements.txt
126142
```
127143

128144
---

core/handlers.py

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,59 @@ async def monitor_loop(monitor, connections):
5555
monitor.get_processes()
5656
)
5757

58+
# Core system metrics
59+
vmem = psutil.virtual_memory()
5860
system_info = {
5961
'cpu_percent': psutil.cpu_percent(percpu=False),
60-
'memory_percent': psutil.virtual_memory().percent,
62+
'memory_percent': vmem.percent,
63+
'memory_total_gb': round(vmem.total / (1024 ** 3), 2),
64+
'memory_used_gb': round(vmem.used / (1024 ** 3), 2),
65+
'memory_available_gb': round(vmem.available / (1024 ** 3), 2),
66+
'cpu_count': psutil.cpu_count(),
6167
'timestamp': datetime.now().isoformat()
6268
}
69+
70+
# Swap memory
71+
try:
72+
swap = psutil.swap_memory()
73+
system_info['swap_percent'] = swap.percent
74+
except Exception:
75+
pass
76+
77+
# CPU frequency
78+
try:
79+
freq = psutil.cpu_freq()
80+
if freq:
81+
system_info['cpu_freq_current'] = round(freq.current, 0)
82+
system_info['cpu_freq_max'] = round(freq.max, 0)
83+
except Exception:
84+
pass
85+
86+
# Load average (Linux/Mac only)
87+
try:
88+
load = psutil.getloadavg()
89+
system_info['load_avg_1'] = round(load[0], 2)
90+
system_info['load_avg_5'] = round(load[1], 2)
91+
system_info['load_avg_15'] = round(load[2], 2)
92+
except (AttributeError, OSError):
93+
pass
94+
95+
# Network I/O (cumulative bytes — frontend computes rate)
96+
try:
97+
net = psutil.net_io_counters()
98+
system_info['net_bytes_sent'] = net.bytes_sent
99+
system_info['net_bytes_recv'] = net.bytes_recv
100+
except Exception:
101+
pass
102+
103+
# Disk I/O (cumulative bytes — frontend computes rate)
104+
try:
105+
disk = psutil.disk_io_counters()
106+
if disk:
107+
system_info['disk_read_bytes'] = disk.read_bytes
108+
system_info['disk_write_bytes'] = disk.write_bytes
109+
except Exception:
110+
pass
63111

64112
data = {
65113
'mode': config.MODE,

core/metrics/collector.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -193,10 +193,9 @@ def _add_fan_speeds(self, handle, data):
193193

194194
def _add_throttling(self, handle, data):
195195
if throttle := safe_get(pynvml.nvmlDeviceGetCurrentClocksThrottleReasons, handle):
196+
# Only report genuinely alarming throttle reasons.
197+
# GPU Idle, App Settings, and SW Power Cap are normal operating conditions.
196198
throttle_map = [
197-
(pynvml.nvmlClocksThrottleReasonGpuIdle, 'GPU Idle'),
198-
(pynvml.nvmlClocksThrottleReasonApplicationsClocksSetting, 'App Settings'),
199-
(pynvml.nvmlClocksThrottleReasonSwPowerCap, 'SW Power Cap'),
200199
(pynvml.nvmlClocksThrottleReasonHwSlowdown, 'HW Slowdown'),
201200
(pynvml.nvmlClocksThrottleReasonSwThermalSlowdown, 'SW Thermal'),
202201
(pynvml.nvmlClocksThrottleReasonHwThermalSlowdown, 'HW Thermal'),

0 commit comments

Comments
 (0)