Skip to content

Commit 40552a7

Browse files
committed
This pull request significantly refactors the install_gpu_driver.sh script, primarily to **resolve the issue of Spark and Hadoop configurations failing during custom image creation, as detailed in [GitHub Issue #1303](#1303
The core problem was that the script attempted to modify configuration files (like `spark-defaults.conf`) before they were created by `bdutil` during the image customization process. This PR implements the proposed solution by deferring these configuration steps until the first boot of the instance. * **Deferred Configuration for Custom Images:** * The script now detects if it's running in a custom image build context by checking the `invocation-type` metadata attribute. This is stored in the `IS_CUSTOM_IMAGE_BUILD` variable. * When `IS_CUSTOM_IMAGE_BUILD` is true, critical Hadoop and Spark configuration steps are no longer executed immediately. Instead, a new systemd service (`dataproc-gpu-config.service`) is generated and enabled. * This service is responsible for running a newly created script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) on the instance's first boot. This generated script now contains all the necessary logic for Hadoop/Spark/GPU configuration (moved into a `run_hadoop_spark_config` function). * This deferral mechanism **explicitly solves issue #1303** by ensuring that configurations are applied only after the Dataproc environment, including necessary configuration files, has been fully initialized. * **Script Structure for Deferred Execution:** * The `main` function has been refactored. It now orchestrates the installation of drivers and core components as before. However, for Hadoop/Spark configurations, it either executes the `apply-dataproc-gpu-config.sh` script directly (if not a custom image build) or enables the systemd service to run it on first boot. * The `create_deferred_config_files` function is responsible for generating the systemd service unit and the `apply-dataproc-gpu-config.sh` script. This script is carefully constructed to include all necessary helper functions and variables from the main `install_gpu_driver.sh` script to run independently. * **Re-evaluation of Environment in Deferred Script:** The deferred script (`apply-dataproc-gpu-config.sh`) re-evaluates critical environment variables like `ROLE`, `SPARK_VERSION`, `gpu_count`, and `IS_MIG_ENABLED` at the time of its execution (first boot) to ensure accuracy. * **CUDA and Driver Updates:** * Added support for Dataproc image version "2.3", defaulting to CUDA version "12.6.3". * Improved robustness in `install_build_dependencies` for Rocky Linux with fallbacks for kernel package downloads. * **Error Handling and Robustness:** * Several commands, like `gsutil rm`, `pip cache purge`, and `wget` in `Workspace_mig_scripts`, have improved error handling or are wrapped in `execute_with_retries`. * Suppressed benign errors from `du` commands during cleanup. * Zeroing of free disk space is now more robust and conditional on custom image builds. * **Configuration and Installation Improvements:** * Dynamically sets `conda_root_path` based on `DATAPROC_IMAGE_VERSION`. * Corrected GPG key handling for the NVIDIA Container Toolkit repository on Debian systems. * Ensures `python3-venv` is installed for the GPU agent on newer Debian-based images. * Streamlined several configuration functions by removing redundant GPU count checks. * Ensures RAPIDS properties are added to `spark-defaults.conf` idempotently. * The `check_secure_boot` function now handles cases where `mokutil` might not be present and provides a clearer error for missing signing material. * The script entry point and preparation steps (`prepare_to_install`) are more clearly defined. By implementing a deferred configuration mechanism for custom image builds, this pull request directly addresses and **resolves the core problem outlined in GitHub issue #1303**, ensuring that GPU-related Hadoop and Spark configurations are applied reliably.
1 parent f50cbfc commit 40552a7

File tree

1 file changed

+418
-80
lines changed

1 file changed

+418
-80
lines changed

0 commit comments

Comments
 (0)