|
| 1 | +# GPU Health Check & Pre-Check Blueprint |
| 2 | + |
| 3 | +This blueprint provides a **pre-check blueprint** for GPU health validation before running production or research workloads. The focus is on delivering a **diagnostic tool** that can run on both single-node and multi-node environments, ensuring that your infrastructure is ready for demanding experiments. |
| 4 | + |
| 5 | +The workflow includes: |
| 6 | +- **Data types** as input (`fp8`, `fp16`, `fp32`, `fp64`) |
| 7 | +- **Custom Functions** for GPU diagnostics |
| 8 | +- **GPU-Burn** for stress testing |
| 9 | +- **Results** collected in JSON files (and optionally PDF reports) |
| 10 | + |
| 11 | +By following this blueprint, you can identify and localize issues such as thermal throttling, power irregularities, or GPU instability before they impact your main workloads. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## 1. Architecture Overview |
| 16 | + |
| 17 | +Below is a simplified overview: |
| 18 | + |
| 19 | +<img width="888" alt="Screenshot 2025-03-13 101052" src="https://github.com/user-attachments/assets/723a8861-388c-4585-b53f-778c2d5c73d6" /> |
| 20 | + |
| 21 | +### Key Points |
| 22 | + |
| 23 | +- Data Types: You can specify one of several floating-point precisions (`fp8`, fp16, fp32, `fp64`). |
| 24 | +- Custom Functions: Diagnostic functions that measure performance metrics such as throughput, memory bandwidth, etc. |
| 25 | +- Single-Node vs. Multi-Node: Tests can be run on a single machine or scaled to multiple machines. |
| 26 | +- GPU-Burn: A specialized stress-testing tool for pushing GPUs to their maximum performance limits. |
| 27 | +- Results: Output is aggregated into JSON files (and optionally PDFs) for analysis. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## 2. Health Check Blueprint |
| 32 | + |
| 33 | +This blueprint aims to give you confidence that your GPUs are healthy. The key checks include: |
| 34 | + |
| 35 | +1. Compute Throughput |
| 36 | + - Dense matrix multiplications or arithmetic operations stress the GPU cores. |
| 37 | + - Ensures sustained performance without degradation. |
| 38 | + |
| 39 | +2. Memory Bandwidth |
| 40 | + - Reading/writing large chunks of data (e.g., `torch.rand()`) tests memory throughput. |
| 41 | + - Verifies the memory subsystem operates at expected speeds. |
| 42 | + |
| 43 | +3. Temperature & Thermal Stability |
| 44 | + - Uses commands like nvidia-smi to monitor temperature. |
| 45 | + - Checks for throttling under load. |
| 46 | + |
| 47 | +4. Power Consumption |
| 48 | + - Monitors power draw (e.g., `nvidia-smi --query-gpu=power.draw --format=csv`). |
| 49 | + - Identifies irregular or excessive power usage. |
| 50 | + |
| 51 | +5. GPU Utilization |
| 52 | + - Ensures GPU cores (including Tensor Cores) are fully engaged during tests. |
| 53 | + - Confirms no unexpected idle time. |
| 54 | + |
| 55 | +6. Error Detection |
| 56 | + - Checks for hardware errors or CUDA-related issues. |
| 57 | + - Asserts numerical correctness to ensure no silent failures. |
| 58 | + |
| 59 | +7. Multi-GPU Testing |
| 60 | + - Validates multi-GPU or multi-node setups. |
| 61 | + - Ensures the entire environment is consistent and stable. |
| 62 | + |
| 63 | +8. Mixed Precision Testing |
| 64 | + - Uses AMP for fp8 or fp16 operations (e.g., `torch.cuda.amp.autocast()`). |
| 65 | + - Confirms performance and compatibility with mixed-precision ops. |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## 3. Data Types and How They Work |
| 70 | + |
| 71 | +- fp8, fp16: Lower precision can offer speedups but requires checks for numerical stability. |
| 72 | +- fp32 (single precision): Standard for most deep learning tasks; tests confirm typical GPU operations. |
| 73 | +- fp64 (double precision): Used in HPC/scientific workloads; verifies performance and accuracy at high precision. |
| 74 | + |
| 75 | +Depending on the dtype you select, the script either runs a set of Custom Functions or launches GPU-Burn to push the hardware to its limits. The results are saved in JSON for analysis. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## 4. Custom Functions |
| 80 | + |
| 81 | +These Python-based diagnostic functions systematically measure: |
| 82 | + |
| 83 | +- Throughput (matrix multiplies, convolution stubs, etc.) |
| 84 | +- Memory bandwidth (large tensor reads/writes) |
| 85 | +- Temperature (via nvidia-smi or other sensors) |
| 86 | +- Power usage |
| 87 | +- GPU utilization |
| 88 | +- Error detection (assert checks, error logs) |
| 89 | +- Multi-GPU orchestration (parallel usage correctness) |
| 90 | +- Mixed precision compatibility (AMP in PyTorch) |
| 91 | + |
| 92 | +They can run on a single node or multiple nodes, with each run producing structured JSON output. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## 5. GPU-Burn |
| 97 | + |
| 98 | +[GPU-Burn](https://github.com/wilicc/gpu-burn) is a stress-testing tool designed to push GPUs to their maximum performance limits. It is typically used to: |
| 99 | + |
| 100 | +- Validate hardware stability |
| 101 | +- Identify potential overheating or faulty components |
| 102 | +- Confirm GPUs can handle extreme workloads without errors or throttling |
| 103 | + |
| 104 | +When you run GPU-Burn in float32 or float64 mode, its output can be captured in a log file, then parsed into JSON or PDF summaries for reporting. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | + |
| 109 | +## 6. Usage |
| 110 | + |
| 111 | +1. Clone the Blueprint & Install Dependencies |
| 112 | + |
| 113 | + |
| 114 | +Bash |
| 115 | + |
| 116 | + git clone <repo_url> |
| 117 | + cd <repo_name> |
| 118 | + docker build -t gpu-healthcheck . |
| 119 | + |
| 120 | + |
| 121 | +2. Run the Pre-Check |
| 122 | + - Single Node Example (fp16): |
| 123 | + |
| 124 | +Bash |
| 125 | + |
| 126 | + |
| 127 | + docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float16 --expected_gpus A10:2,A100:0,H100:0 |
| 128 | + |
| 129 | + |
| 130 | + - GPU-Burn Stress Test (float32): |
| 131 | + |
| 132 | +Bash |
| 133 | + |
| 134 | + |
| 135 | + docker run --gpus all -it -v $(pwd)/results:/app/testing_results gpu-healthcheck --dtype float32 --expected_gpus A10:2,A100:0,H100:0 |
| 136 | + |
| 137 | + |
| 138 | +3. Examine Results |
| 139 | + - JSON output is located in the results/ directory. |
| 140 | + - PDF summaries will also be generated. |
| 141 | + |
| 142 | +--- |
| 143 | +## 7. Implementing it into OCI AI Blueprints |
| 144 | + |
| 145 | +This is an example of json file which be used to deploy into OCI AI Blueprints: |
| 146 | + |
| 147 | +```json |
| 148 | +{ |
| 149 | + "recipe_id": "healthcheck", |
| 150 | + "recipe_mode": "job", |
| 151 | + "deployment_name": "healthcheck", |
| 152 | + "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3", |
| 153 | + "recipe_node_shape": "VM.GPU.A10.2", |
| 154 | + "output_object_storage": [ |
| 155 | + { |
| 156 | + "bucket_name": "healthcheck2", |
| 157 | + "mount_location": "/healthcheck_results", |
| 158 | + "volume_size_in_gbs": 20 |
| 159 | + } |
| 160 | + ], |
| 161 | + "recipe_container_command_args": [ |
| 162 | + "--dtype", "float16", "--output_dir", "/healthcheck_results", "--expected_gpus", "A10:2,A100:0,H100:0" |
| 163 | + ], |
| 164 | + "recipe_replica_count": 1, |
| 165 | + "recipe_nvidia_gpu_count": 2, |
| 166 | + "recipe_node_pool_size": 1, |
| 167 | + "recipe_node_boot_volume_size_in_gbs": 200, |
| 168 | + "recipe_ephemeral_storage_size": 100, |
| 169 | + "recipe_shared_memory_volume_size_limit_in_mb": 1000, |
| 170 | + "recipe_use_shared_node_pool": true |
| 171 | +} |
| 172 | +``` |
| 173 | +--- |
| 174 | + |
| 175 | +## 8. Contact |
| 176 | + |
| 177 | +For questions or additional information, open an issue in this blueprint or contact the maintainers directly. |
0 commit comments