11<div align =" center " >
22
33# GPU Hot
4- ### ** Real-time NVIDIA GPU Monitoring Dashboard**
54
6- Monitor NVIDIA GPUs from any browser. No SSH , no configuration – just start and view in real-time .
5+ Real-time NVIDIA GPU monitoring dashboard. Web-based , no SSH required .
76
87[ ![ Python] ( https://img.shields.io/badge/Python-3.8+-3776AB?style=flat-square&logo=python&logoColor=white )] ( https://www.python.org/ )
98[ ![ Docker] ( https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white )] ( https://www.docker.com/ )
@@ -14,108 +13,69 @@ Monitor NVIDIA GPUs from any browser. No SSH, no configuration – just start an
1413
1514</div >
1615
16+ ---
1717
18- ## Quick Start
18+ ## Usage
1919
20- ### Docker (recommended)
20+ Monitor a single machine or an entire cluster with the same Docker image.
2121
22+ ** Single machine:**
2223``` bash
23- docker run -d --name gpu-hot -- gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
24+ docker run -d --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
2425```
2526
26- ** Force nvidia-smi mode (for older GPUs) :**
27+ ** Multiple machines :**
2728``` bash
28- docker run -d --name gpu-hot --gpus all -p 1312:1312 -e NVIDIA_SMI=true ghcr.io/psalias2006/gpu-hot:latest
29+ # On each GPU server
30+ docker run -d --gpus all -p 1312:1312 -e NODE_NAME=$( hostname) ghcr.io/psalias2006/gpu-hot:latest
31+
32+ # On a hub machine (no GPU required)
33+ docker run -d -p 1312:1312 -e GPU_HOT_MODE=hub -e NODE_URLS=http://server1:1312,http://server2:1312,http://server3:1312 ghcr.io/psalias2006/gpu-hot:latest
2934```
3035
3136Open ` http://localhost:1312 `
3237
33- ### From source
38+ ** Older GPUs: ** Add ` -e NVIDIA_SMI=true ` if metrics don't appear.
3439
40+ ** From source:**
3541``` bash
3642git clone https://github.com/psalias2006/gpu-hot
3743cd gpu-hot
3844docker-compose up --build
3945```
4046
41- ### Local dev
42-
43- ``` bash
44- pip install -r requirements.txt
45- python app.py
46- ```
47-
48- ** Requirements:** Docker + NVIDIA Container Toolkit ([ install guide] ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ) )
47+ ** Requirements:** Docker + [ NVIDIA Container Toolkit] ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html )
4948
5049---
5150
5251## Features
5352
54- ** Sub-Second Updates:**
55- - ** Lightning-fast refresh rates**
56- - Historical data tracking
57- - WebSocket real-time streaming
58-
59- ** Charts:**
60- - Utilization, Temperature, Memory, Power
61- - Fan Speed, Clock Speeds, Power Efficiency
62-
63- ** Monitoring:**
64- - Multi-GPU detection
65- - Process tracking (PID, memory usage)
66- - System CPU/RAM
67- - WebSocket real-time updates
68-
69- ** Metrics:**
70- - GPU & Memory Utilization (%)
71- - Temperature (GPU core, memory)
72- - Memory (used/free/total)
73- - Power draw & limits
74- - Fan Speed (%)
75- - Clock Speeds (graphics, SM, memory, video)
76- - PCIe Gen & width
77- - Performance State (P-State)
78- - Compute Mode
79- - Encoder/Decoder sessions
80- - Throttle status
53+ - Real-time metrics (sub-second)
54+ - Automatic multi-GPU detection
55+ - Process monitoring (PID, memory usage)
56+ - Historical charts (utilization, temperature, power, clocks)
57+ - System metrics (CPU, RAM)
58+ - Scale from 1 to 100+ GPUs
59+
60+ ** Metrics:** Utilization, temperature, memory, power draw, fan speed, clock speeds, PCIe info, P-State, throttle status, encoder/decoder sessions
8161
8262---
8363
8464## Configuration
8565
86- Optional. Edit ` core/config.py ` :
87-
88- ``` python
89- UPDATE_INTERVAL = 0.5 # NVML polling interval (fast)
90- NVIDIA_SMI_INTERVAL = 2.0 # nvidia-smi polling interval (slower to reduce overhead)
91- PORT = 1312 # Web server port
92- DEBUG = False
93- ```
94-
95- Environment variables:
66+ ** Environment variables:**
9667``` bash
9768NVIDIA_VISIBLE_DEVICES=0,1 # Specific GPUs (default: all)
98- NVIDIA_SMI=true # Force nvidia-smi mode for all GPUs
99- ```
100-
101- ** nvidia-smi Fallback:**
102- - Automatically detects GPUs that don't support NVML utilization metrics
103- - Falls back to nvidia-smi for those GPUs
104- - Compatible with older GPUs (Quadro P1000, Tesla, etc.)
105-
106- ** Force nvidia-smi for all GPUs:**
107- - Docker: ` docker run -e NVIDIA_SMI=true ... `
108- - Config: Set ` NVIDIA_SMI = True ` in ` core/config.py `
109-
110- Frontend tuning in ` static/js/socket-handlers.js ` :
111- ``` javascript
112- DOM_UPDATE_INTERVAL = 1000 // Text updates frequency (ms)
113- SCROLL_PAUSE_DURATION = 100 // Scroll optimization (ms)
69+ NVIDIA_SMI=true # Force nvidia-smi mode for older GPUs
70+ GPU_HOT_MODE=hub # Set to 'hub' for multi-node aggregation (default: single node)
71+ NODE_NAME=gpu-server-1 # Node display name (default: hostname)
72+ NODE_URLS=http://host:1312... # Comma-separated node URLs (required for hub mode)
11473```
11574
116- Chart history in ` static/js/charts.js ` :
117- ``` javascript
118- if (data .labels .length > 120 ) // Data points to keep
75+ ** Backend (` core/config.py ` ):**
76+ ``` python
77+ UPDATE_INTERVAL = 0.5 # Polling interval
78+ PORT = 1312 # Server port
11979```
12080
12181---
@@ -131,41 +91,10 @@ GET /api/gpu-data # JSON metrics
13191### WebSocket
13292``` javascript
13393socket .on (' gpu_data' , (data ) => {
134- // Updates every 0.5s
135- // data.gpus, data.processes, data.system
94+ // Updates every 0.5s (configurable)
95+ // Contains: data.gpus, data.processes, data.system
13696});
13797```
138-
139- ---
140-
141- ## Extending
142-
143- Add new metrics:
144-
145- ** Backend (` core/metrics/collector.py ` ):**
146- ``` python
147- # Add NVML query
148- value = pynvml.nvmlDeviceGetYourMetric(handle)
149- gpu_data[' your_metric' ] = value
150- ```
151-
152- ** Frontend (` static/js/gpu-cards.js ` ):**
153- ``` javascript
154- // Add to card template
155- < div class = " metric-value" id= " your-metric-${gpuId}" >
156- ${gpuInfo .your_metric }
157- < / div>
158-
159- // Add to update function
160- if (yourMetricEl) yourMetricEl .textContent = gpuInfo .your_metric ;
161- ```
162-
163- ** Chart (optional):**
164- ``` javascript
165- // static/js/charts.js
166- chartConfigs .yourMetric = { type: ' line' , ... };
167- ```
168-
16998---
17099
171100## Project Structure
@@ -196,53 +125,29 @@ gpu-hot/
196125
197126---
198127
199- ## Performance
200-
201- Frontend uses ` requestAnimationFrame ` batching to minimize reflows. Scroll detection pauses DOM updates during scrolling.
202-
203- For heavy workloads or many GPUs, increase update intervals in ` core/config.py ` .
204-
205- ---
206-
207128## Troubleshooting
208129
209- ** GPU not detected:**
130+ ** No GPUs detected:**
210131``` bash
211- # Verify drivers
212- nvidia-smi
213-
214- # Test Docker GPU access
215- docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
216-
217- # Restart Docker
218- sudo systemctl restart docker
132+ nvidia-smi # Verify drivers work
133+ docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi # Test Docker GPU access
219134```
220135
221- ** Performance issues:**
222- - Increase ` UPDATE_INTERVAL ` in ` core/config.py `
223- - Reduce chart history in ` static/js/charts.js `
224- - Check browser console for errors
225-
226- ** Debug mode:**
227- ``` python
228- # core/config.py
229- DEBUG = True
136+ ** Hub can't connect to nodes:**
137+ ``` bash
138+ curl http://node-ip:1312/api/gpu-data # Test connectivity
139+ sudo ufw allow 1312/tcp # Check firewall
230140```
231141
142+ ** Performance issues:** Increase ` UPDATE_INTERVAL ` in ` core/config.py `
143+
232144---
233145
234146## Contributing
235147
236- PRs welcome. For major changes, open an issue first .
148+ PRs welcome. Open an issue for major changes .
237149
238150## License
239151
240152MIT - see [ LICENSE] ( LICENSE )
241-
242- ---
243-
244- <div align =" center " >
245-
246- [ Report Bug] ( https://github.com/psalias2006/gpu-hot/issues ) • [ Request Feature] ( https://github.com/psalias2006/gpu-hot/issues )
247-
248- </div >
153+
0 commit comments