Commit 98cb200
committed
feat(gpu): Enhance custom image support, caching, and configuration
This PR significantly refactors the GPU initialization action to improve support for custom image builds, enhance robustness, and update documentation.
**Key Changes:**
1. **Custom Image Building (`invocation-type=custom-images`):**
* The script now detects the `invocation-type=custom-images` metadata.
* When detected, Hadoop/Spark configurations are deferred to the first boot of a cluster instance created from the custom image. This is managed by a new systemd service, `dataproc-gpu-config.service`.
* This prevents issues where configurations are applied too early in the image build process.
2. **GCS Caching and Performance:**
* The README now extensively details the GCS caching mechanism for downloaded artifacts (drivers, CUDA) and compiled components (kernel modules, NCCL).
* Highlights the significant time savings on subsequent runs after the cache is warmed.
* Warns about potentially long first-run times (up to 150 mins on small instances) if components need to be built from source. Recommends pre-warming the cache on a larger instance.
* Notes the security benefit of using cached artifacts, reducing the need for build tools on cluster nodes.
3. **Hash Validation:**
* Added SHA256 hash verification for downloaded NVIDIA driver and CUDA `.run` files to ensure integrity.
4. **Documentation (`gpu/README.md`):**
* Fully revamped to reflect the script changes.
* Updated default CUDA versions and tested configurations.
* Clearer `gcloud` examples.
* New section on custom image usage.
* Updated metadata parameters list.
* Improved Secure Boot and troubleshooting sections.
* Clarified GPU agent metric reporting.
5. **Script Enhancements (`gpu/install_gpu_driver.sh`):**
* Refactored configuration logic into functions called conditionally.
* Improved GPG key fetching behind a proxy.
* Adjusted Conda paths for Dataproc 2.3+.
* More robust `kernel-devel` fetching on Rocky Linux.
* Better `DATAPROC_IMAGE_VERSION` detection.
**Purpose:**
These changes make the GPU initialization action more flexible for use in custom image pipelines, improve the reliability of installations, and provide users with better guidance on performance and security implications.1 parent 2eb939b commit 98cb200
File tree
4 files changed
+1001
-379
lines changed- cloudbuild
- gpu
4 files changed
+1001
-379
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| 73 | + | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
51 | 62 | | |
52 | 63 | | |
53 | 64 | | |
| |||
0 commit comments