Skip to content

Commit ba4b218

Browse files
[WIP] Hub / Agent Mode support (#28)
* Test * Multi-Node works * Done * test commit * Migrates from HTTP to WebSocket for real-time updates Simplifies the user interface Refactors Docker usage and internal logic * update the README.md and the page * still under test * still under test * still under test * still under test * still under test --------- Co-authored-by: Panos <>
1 parent e2fcf7f commit ba4b218

File tree

17 files changed

+1011
-231
lines changed

17 files changed

+1011
-231
lines changed

.dockerignore

Lines changed: 9 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,18 @@
1-
# Version control
2-
.git
3-
.gitignore
4-
.gitattributes
5-
6-
# Python
7-
__pycache__
1+
__pycache__/
82
*.pyc
93
*.pyo
104
*.pyd
115
.Python
126
*.so
137
*.egg
14-
*.egg-info
15-
dist
16-
build
17-
18-
# Virtual Environment
19-
venv/
20-
env/
21-
ENV/
22-
.venv/
23-
24-
# IDE
25-
.vscode/
26-
.idea/
27-
*.swp
28-
*.swo
29-
*~
30-
.DS_Store
31-
32-
# Documentation
33-
README.md
34-
LICENSE
8+
*.egg-info/
9+
dist/
10+
build/
11+
.git/
12+
.gitignore
3513
*.md
14+
!README.md
3615
docs/
37-
38-
# Docker
39-
Dockerfile
40-
docker-compose.yml
41-
docker-compose.override.yml
42-
.dockerignore
43-
44-
# CI/CD
45-
.github/
46-
.gitlab-ci.yml
47-
.travis.yml
48-
Jenkinsfile
49-
50-
# Testing
51-
tests/
52-
test/
53-
*.test.py
54-
.pytest_cache/
55-
.coverage
56-
htmlcov/
57-
.tox/
58-
59-
# Logs
60-
*.log
61-
logs/
62-
63-
# Temporary files
64-
tmp/
65-
temp/
66-
*.tmp
67-
*.temp
68-
69-
# OS files
16+
*.png
17+
LICENSE
7018
.DS_Store
71-
Thumbs.db
72-
73-
# Environment files
74-
.env
75-
.env.local
76-
.env.*.local

README.md

Lines changed: 45 additions & 140 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
<div align="center">
22

33
# GPU Hot
4-
### **Real-time NVIDIA GPU Monitoring Dashboard**
54

6-
Monitor NVIDIA GPUs from any browser. No SSH, no configuration – just start and view in real-time.
5+
Real-time NVIDIA GPU monitoring dashboard. Web-based, no SSH required.
76

87
[![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
98
[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/)
@@ -14,108 +13,69 @@ Monitor NVIDIA GPUs from any browser. No SSH, no configuration – just start an
1413

1514
</div>
1615

16+
---
1717

18-
## Quick Start
18+
## Usage
1919

20-
### Docker (recommended)
20+
Monitor a single machine or an entire cluster with the same Docker image.
2121

22+
**Single machine:**
2223
```bash
23-
docker run -d --name gpu-hot --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
24+
docker run -d --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
2425
```
2526

26-
**Force nvidia-smi mode (for older GPUs):**
27+
**Multiple machines:**
2728
```bash
28-
docker run -d --name gpu-hot --gpus all -p 1312:1312 -e NVIDIA_SMI=true ghcr.io/psalias2006/gpu-hot:latest
29+
# On each GPU server
30+
docker run -d --gpus all -p 1312:1312 -e NODE_NAME=$(hostname) ghcr.io/psalias2006/gpu-hot:latest
31+
32+
# On a hub machine (no GPU required)
33+
docker run -d -p 1312:1312 -e GPU_HOT_MODE=hub -e NODE_URLS=http://server1:1312,http://server2:1312,http://server3:1312 ghcr.io/psalias2006/gpu-hot:latest
2934
```
3035

3136
Open `http://localhost:1312`
3237

33-
### From source
38+
**Older GPUs:** Add `-e NVIDIA_SMI=true` if metrics don't appear.
3439

40+
**From source:**
3541
```bash
3642
git clone https://github.com/psalias2006/gpu-hot
3743
cd gpu-hot
3844
docker-compose up --build
3945
```
4046

41-
### Local dev
42-
43-
```bash
44-
pip install -r requirements.txt
45-
python app.py
46-
```
47-
48-
**Requirements:** Docker + NVIDIA Container Toolkit ([install guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html))
47+
**Requirements:** Docker + [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
4948

5049
---
5150

5251
## Features
5352

54-
**Sub-Second Updates:**
55-
- **Lightning-fast refresh rates**
56-
- Historical data tracking
57-
- WebSocket real-time streaming
58-
59-
**Charts:**
60-
- Utilization, Temperature, Memory, Power
61-
- Fan Speed, Clock Speeds, Power Efficiency
62-
63-
**Monitoring:**
64-
- Multi-GPU detection
65-
- Process tracking (PID, memory usage)
66-
- System CPU/RAM
67-
- WebSocket real-time updates
68-
69-
**Metrics:**
70-
- GPU & Memory Utilization (%)
71-
- Temperature (GPU core, memory)
72-
- Memory (used/free/total)
73-
- Power draw & limits
74-
- Fan Speed (%)
75-
- Clock Speeds (graphics, SM, memory, video)
76-
- PCIe Gen & width
77-
- Performance State (P-State)
78-
- Compute Mode
79-
- Encoder/Decoder sessions
80-
- Throttle status
53+
- Real-time metrics (sub-second)
54+
- Automatic multi-GPU detection
55+
- Process monitoring (PID, memory usage)
56+
- Historical charts (utilization, temperature, power, clocks)
57+
- System metrics (CPU, RAM)
58+
- Scale from 1 to 100+ GPUs
59+
60+
**Metrics:** Utilization, temperature, memory, power draw, fan speed, clock speeds, PCIe info, P-State, throttle status, encoder/decoder sessions
8161

8262
---
8363

8464
## Configuration
8565

86-
Optional. Edit `core/config.py`:
87-
88-
```python
89-
UPDATE_INTERVAL = 0.5 # NVML polling interval (fast)
90-
NVIDIA_SMI_INTERVAL = 2.0 # nvidia-smi polling interval (slower to reduce overhead)
91-
PORT = 1312 # Web server port
92-
DEBUG = False
93-
```
94-
95-
Environment variables:
66+
**Environment variables:**
9667
```bash
9768
NVIDIA_VISIBLE_DEVICES=0,1 # Specific GPUs (default: all)
98-
NVIDIA_SMI=true # Force nvidia-smi mode for all GPUs
99-
```
100-
101-
**nvidia-smi Fallback:**
102-
- Automatically detects GPUs that don't support NVML utilization metrics
103-
- Falls back to nvidia-smi for those GPUs
104-
- Compatible with older GPUs (Quadro P1000, Tesla, etc.)
105-
106-
**Force nvidia-smi for all GPUs:**
107-
- Docker: `docker run -e NVIDIA_SMI=true ...`
108-
- Config: Set `NVIDIA_SMI = True` in `core/config.py`
109-
110-
Frontend tuning in `static/js/socket-handlers.js`:
111-
```javascript
112-
DOM_UPDATE_INTERVAL = 1000 // Text updates frequency (ms)
113-
SCROLL_PAUSE_DURATION = 100 // Scroll optimization (ms)
69+
NVIDIA_SMI=true # Force nvidia-smi mode for older GPUs
70+
GPU_HOT_MODE=hub # Set to 'hub' for multi-node aggregation (default: single node)
71+
NODE_NAME=gpu-server-1 # Node display name (default: hostname)
72+
NODE_URLS=http://host:1312... # Comma-separated node URLs (required for hub mode)
11473
```
11574

116-
Chart history in `static/js/charts.js`:
117-
```javascript
118-
if (data.labels.length > 120) // Data points to keep
75+
**Backend (`core/config.py`):**
76+
```python
77+
UPDATE_INTERVAL = 0.5 # Polling interval
78+
PORT = 1312 # Server port
11979
```
12080

12181
---
@@ -131,41 +91,10 @@ GET /api/gpu-data # JSON metrics
13191
### WebSocket
13292
```javascript
13393
socket.on('gpu_data', (data) => {
134-
// Updates every 0.5s
135-
// data.gpus, data.processes, data.system
94+
// Updates every 0.5s (configurable)
95+
// Contains: data.gpus, data.processes, data.system
13696
});
13797
```
138-
139-
---
140-
141-
## Extending
142-
143-
Add new metrics:
144-
145-
**Backend (`core/metrics/collector.py`):**
146-
```python
147-
# Add NVML query
148-
value = pynvml.nvmlDeviceGetYourMetric(handle)
149-
gpu_data['your_metric'] = value
150-
```
151-
152-
**Frontend (`static/js/gpu-cards.js`):**
153-
```javascript
154-
// Add to card template
155-
<div class="metric-value" id="your-metric-${gpuId}">
156-
${gpuInfo.your_metric}
157-
</div>
158-
159-
// Add to update function
160-
if (yourMetricEl) yourMetricEl.textContent = gpuInfo.your_metric;
161-
```
162-
163-
**Chart (optional):**
164-
```javascript
165-
// static/js/charts.js
166-
chartConfigs.yourMetric = { type: 'line', ... };
167-
```
168-
16998
---
17099

171100
## Project Structure
@@ -196,53 +125,29 @@ gpu-hot/
196125

197126
---
198127

199-
## Performance
200-
201-
Frontend uses `requestAnimationFrame` batching to minimize reflows. Scroll detection pauses DOM updates during scrolling.
202-
203-
For heavy workloads or many GPUs, increase update intervals in `core/config.py`.
204-
205-
---
206-
207128
## Troubleshooting
208129

209-
**GPU not detected:**
130+
**No GPUs detected:**
210131
```bash
211-
# Verify drivers
212-
nvidia-smi
213-
214-
# Test Docker GPU access
215-
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
216-
217-
# Restart Docker
218-
sudo systemctl restart docker
132+
nvidia-smi # Verify drivers work
133+
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi # Test Docker GPU access
219134
```
220135

221-
**Performance issues:**
222-
- Increase `UPDATE_INTERVAL` in `core/config.py`
223-
- Reduce chart history in `static/js/charts.js`
224-
- Check browser console for errors
225-
226-
**Debug mode:**
227-
```python
228-
# core/config.py
229-
DEBUG = True
136+
**Hub can't connect to nodes:**
137+
```bash
138+
curl http://node-ip:1312/api/gpu-data # Test connectivity
139+
sudo ufw allow 1312/tcp # Check firewall
230140
```
231141

142+
**Performance issues:** Increase `UPDATE_INTERVAL` in `core/config.py`
143+
232144
---
233145

234146
## Contributing
235147

236-
PRs welcome. For major changes, open an issue first.
148+
PRs welcome. Open an issue for major changes.
237149

238150
## License
239151

240152
MIT - see [LICENSE](LICENSE)
241-
242-
---
243-
244-
<div align="center">
245-
246-
[Report Bug](https://github.com/psalias2006/gpu-hot/issues)[Request Feature](https://github.com/psalias2006/gpu-hot/issues)
247-
248-
</div>
153+

0 commit comments

Comments
 (0)