Commit 40552a7
committed
The core problem was that the script attempted to modify configuration files (like `spark-defaults.conf`) before they were created by `bdutil` during the image customization process. This PR implements the proposed solution by deferring these configuration steps until the first boot of the instance.
* **Deferred Configuration for Custom Images:**
* The script now detects if it's running in a custom image build context by checking the `invocation-type` metadata attribute. This is stored in the `IS_CUSTOM_IMAGE_BUILD` variable.
* When `IS_CUSTOM_IMAGE_BUILD` is true, critical Hadoop and Spark configuration steps are no longer executed immediately. Instead, a new systemd service (`dataproc-gpu-config.service`) is generated and enabled.
* This service is responsible for running a newly created script (`/usr/local/sbin/apply-dataproc-gpu-config.sh`) on the instance's first boot. This generated script now contains all the necessary logic for Hadoop/Spark/GPU configuration (moved into a `run_hadoop_spark_config` function).
* This deferral mechanism **explicitly solves issue #1303** by ensuring that configurations are applied only after the Dataproc environment, including necessary configuration files, has been fully initialized.
* **Script Structure for Deferred Execution:**
* The `main` function has been refactored. It now orchestrates the installation of drivers and core components as before. However, for Hadoop/Spark configurations, it either executes the `apply-dataproc-gpu-config.sh` script directly (if not a custom image build) or enables the systemd service to run it on first boot.
* The `create_deferred_config_files` function is responsible for generating the systemd service unit and the `apply-dataproc-gpu-config.sh` script. This script is carefully constructed to include all necessary helper functions and variables from the main `install_gpu_driver.sh` script to run independently.
* **Re-evaluation of Environment in Deferred Script:** The deferred script (`apply-dataproc-gpu-config.sh`) re-evaluates critical environment variables like `ROLE`, `SPARK_VERSION`, `gpu_count`, and `IS_MIG_ENABLED` at the time of its execution (first boot) to ensure accuracy.
* **CUDA and Driver Updates:**
* Added support for Dataproc image version "2.3", defaulting to CUDA version "12.6.3".
* Improved robustness in `install_build_dependencies` for Rocky Linux with fallbacks for kernel package downloads.
* **Error Handling and Robustness:**
* Several commands, like `gsutil rm`, `pip cache purge`, and `wget` in `Workspace_mig_scripts`, have improved error handling or are wrapped in `execute_with_retries`.
* Suppressed benign errors from `du` commands during cleanup.
* Zeroing of free disk space is now more robust and conditional on custom image builds.
* **Configuration and Installation Improvements:**
* Dynamically sets `conda_root_path` based on `DATAPROC_IMAGE_VERSION`.
* Corrected GPG key handling for the NVIDIA Container Toolkit repository on Debian systems.
* Ensures `python3-venv` is installed for the GPU agent on newer Debian-based images.
* Streamlined several configuration functions by removing redundant GPU count checks.
* Ensures RAPIDS properties are added to `spark-defaults.conf` idempotently.
* The `check_secure_boot` function now handles cases where `mokutil` might not be present and provides a clearer error for missing signing material.
* The script entry point and preparation steps (`prepare_to_install`) are more clearly defined.
By implementing a deferred configuration mechanism for custom image builds, this pull request directly addresses and **resolves the core problem outlined in GitHub issue #1303**, ensuring that GPU-related Hadoop and Spark configurations are applied reliably.
1 parent f50cbfc commit 40552a7
1 file changed
+418
-80
lines changed
0 commit comments