Skip to content

Commit 1dd4c8d

Browse files
committed
adding the readme file
1 parent f2e6a47 commit 1dd4c8d

File tree

1 file changed

+177
-0
lines changed

1 file changed

+177
-0
lines changed

docs/healthcheck/readme.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# GPU Health Check & Pre-Check Blueprint
2+
3+
This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments.
4+
5+
The workflow includes:
6+
- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`)
7+
- **Custom Functions** for GPU diagnostics
8+
- **GPU-Burn** for stress testing
9+
- **Results** collected in JSON files (and optionally PDF reports)
10+
11+
By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads.
12+
13+
---
14+
15+
## 1. Architecture Overview
16+
17+
Below is a simplified overview:
18+
19+
<img width="888" alt="Screenshot 2025-03-13 101052" src="https://github.com/user-attachments/assets/723a8861-388c-4585-b53f-778c2d5c73d6" />
20+
21+
### Key Points
22+
23+
- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`).
24+
- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc.
25+
- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines.
26+
- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits.
27+
- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis.
28+
29+
---
30+
31+
## 2. Health Check Blueprint
32+
33+
This blueprint aims to give you confidence that your GPUs are healthy. The key checks include:
34+
35+
1. Compute Throughput
36+
- Dense matrix multiplications or arithmetic operations stress the GPU cores.
37+
- Ensures sustained performance without degradation.
38+
39+
2. Memory Bandwidth
40+
- Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput.
41+
- Verifies the memory subsystem operates at expected speeds.
42+
43+
3. Temperature & Thermal Stability
44+
- Uses commands like nvidia-smi to monitor temperature.
45+
- Checks for throttling under load.
46+
47+
4. Power Consumption
48+
- Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`).
49+
- Identifies irregular or excessive power usage.
50+
51+
5. GPU Utilization
52+
- Ensures GPU cores (including Tensor Cores) are fully engaged during tests.
53+
- Confirms no unexpected idle time.
54+
55+
6. Error Detection
56+
- Checks for hardware errors or CUDA-related issues.
57+
- Asserts numerical correctness to ensure no silent failures.
58+
59+
7. Multi-GPU Testing
60+
- Validates multi-GPU or multi-node setups.
61+
- Ensures the entire environment is consistent and stable.
62+
63+
8. Mixed Precision Testing
64+
- Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`).
65+
- Confirms performance and compatibility with mixed-precision ops.
66+
67+
---
68+
69+
## 3. Data Types and How They Work
70+
71+
- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability.
72+
- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations.
73+
- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision.
74+
75+
Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis.
76+
77+
---
78+
79+
## 4. Custom Functions
80+
81+
These Python-based diagnostic functions systematically measure:
82+
83+
- Throughput (matrix multiplies, convolution stubs, etc.)
84+
- Memory bandwidth (large tensor reads/writes)
85+
- Temperature (via nvidia-smi or other sensors)
86+
- Power usage
87+
- GPU utilization
88+
- Error detection (assert checks, error logs)
89+
- Multi-GPU orchestration (parallel usage correctness)
90+
- Mixed precision compatibility (AMP in PyTorch)
91+
92+
They can run on a single node or multiple nodes, with each run producing structured JSON output.
93+
94+
---
95+
96+
## 5. GPU-Burn
97+
98+
[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to:
99+
100+
- Validate hardware stability
101+
- Identify potential overheating or faulty components
102+
- Confirm GPUs can handle extreme workloads without errors or throttling
103+
104+
When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting.
105+
106+
---
107+
108+
109+
## 6. Usage
110+
111+
1. Clone the Blueprint & Install Dependencies
112+
113+
114+
Bash
115+
116+
git clone <repo_url>
117+
cd <repo_name>
118+
docker build -t gpu-healthcheck .
119+
120+
121+
2. Run the Pre-Check
122+
- Single Node Example (fp16):
123+
124+
Bash
125+
126+
127+
docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0
128+
129+
130+
- GPU-Burn Stress Test (float32):
131+
132+
Bash
133+
134+
135+
docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0
136+
137+
138+
3. Examine Results
139+
- JSON output is located in the results/ directory.
140+
- PDF summaries will also be generated.
141+
142+
---
143+
## 7. Implementing it into OCI AI Blueprints
144+
145+
This is an example of json file which be used to deploy into OCI AI Blueprints:
146+
147+
```json
148+
{
149+
"recipe_id": "healthcheck",
150+
"recipe_mode": "job",
151+
"deployment_name": "healthcheck",
152+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
153+
"recipe_node_shape": "VM.GPU.A10.2",
154+
"output_object_storage": [
155+
{
156+
"bucket_name": "healthcheck2",
157+
"mount_location": "/healthcheck_results",
158+
"volume_size_in_gbs": 20
159+
}
160+
],
161+
"recipe_container_command_args": [
162+
"--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0"
163+
],
164+
"recipe_replica_count": 1,
165+
"recipe_nvidia_gpu_count": 2,
166+
"recipe_node_pool_size": 1,
167+
"recipe_node_boot_volume_size_in_gbs": 200,
168+
"recipe_ephemeral_storage_size": 100,
169+
"recipe_shared_memory_volume_size_limit_in_mb": 1000,
170+
"recipe_use_shared_node_pool": true
171+
}
172+
```
173+
---
174+
175+
## 8. Contact
176+
177+
For questions or additional information, open an issue in this blueprint or contact the maintainers directly.

0 commit comments

Comments
 (0)