Skip to content

Commit 17b1f6e

Browse files
[spark-rapids.sh] Refactor NVIDIA driver installation for Rocky Linux 8 to use run file (#1359)
* Refactor NVIDIA driver installation for Rocky Linux 8 Updated the installation process for the NVIDIA GPU driver on Rocky Linux 8. The script now installs kernel development packages directly and downloads the CUDA installer run file, executing it in silent mode. The installer file is removed post-installation to clean up. This change simplifies the installation steps and ensures the correct driver version is used. * feat: Enable spark-rapids on Dataproc 2.1 Rocky Linux 8 This commit integrates changes to enable the spark-rapids initialization action on Dataproc 2.1-rocky8 images. - Updates the NVIDIA driver installation process in `spark-rapids.sh` for Rocky Linux: - Uses `curl` with retry and fail-fast options for downloading the CUDA installer. - Executes the NVIDIA installer with `--silent --driver --toolkit --no-opengl-libs` flags and wraps it in `execute_with_retries`. - Modifies `test_spark_rapids.py` to enable tests for Rocky Linux on Dataproc 2.1 and below, while keeping them skipped for 2.2+ (Rocky 9). This resolves the installation issues on Rocky 8. Further work is required to support Rocky 9 (Dataproc 2.2). --------- Co-authored-by: C.J. Collier <[email protected]>
1 parent 2eb939b commit 17b1f6e

File tree

2 files changed

+19
-16
lines changed

2 files changed

+19
-16
lines changed

spark-rapids/spark-rapids.sh

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -501,17 +501,20 @@ function install_nvidia_gpu_driver() {
501501

502502
elif is_rocky ; then
503503

504-
# Ensure the Correct Kernel Development Packages are Installed
505-
execute_with_retries "dnf -y -q update --exclude=systemd*,kernel*"
506-
execute_with_retries "dnf -y -q install pciutils kernel-devel gcc"
504+
# Install kernel development packages
505+
execute_with_retries "dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)"
507506

508-
readonly NVIDIA_ROCKY_REPO_URL="${NVIDIA_REPO_URL}/cuda-${shortname}.repo"
509-
execute_with_retries "dnf config-manager --add-repo ${NVIDIA_ROCKY_REPO_URL}"
510-
execute_with_retries "dnf clean all"
511-
configure_dkms_certs
512-
execute_with_retries "dnf -y -q module install nvidia-driver:latest-dkms"
513-
clear_dkms_key
514-
execute_with_retries "dnf -y -q install cuda-toolkit"
507+
# Download the CUDA installer run file
508+
curl -fsSL --retry-connrefused --retry 3 --retry-max-time 30 -o driver.run \
509+
"https://developer.download.nvidia.com/compute/cuda/${CUDA_VERSION}/local_installers/cuda_${CUDA_VERSION}_${NVIDIA_DRIVER_VERSION}_linux.run"
510+
511+
# Run the installer in silent mode
512+
execute_with_retries "bash driver.run --silent --driver --toolkit --no-opengl-libs"
513+
514+
# Remove the installer file after installation to clean up
515+
rm driver.run
516+
517+
# Load the NVIDIA kernel module
515518
modprobe nvidia
516519

517520
else

spark-rapids/test_spark_rapids.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ def verify_spark_job_sql(self):
5858
("STANDARD", ["w-0"], GPU_T4))
5959
def test_spark_rapids(self, configuration, machine_suffixes, accelerator):
6060

61-
if self.getImageOs() == "rocky":
62-
self.skipTest("Not supported for Rocky OS")
61+
if self.getImageVersion() > pkg_resources.parse_version("2.0") and self.getImageOs() == "rocky":
62+
self.skipTest("Not supported for Rocky 9")
6363

6464
if self.getImageVersion() <= pkg_resources.parse_version("2.0"):
6565
self.skipTest("Not supported in 2.0 and earlier images")
@@ -88,8 +88,8 @@ def test_spark_rapids(self, configuration, machine_suffixes, accelerator):
8888
("STANDARD", ["w-0"], GPU_T4))
8989
def test_spark_rapids_sql(self, configuration, machine_suffixes, accelerator):
9090

91-
if self.getImageOs() == "rocky":
92-
self.skipTest("Not supported for Rocky OS")
91+
if self.getImageVersion() > pkg_resources.parse_version("2.0") and self.getImageOs() == "rocky":
92+
self.skipTest("Not supported for Rocky 9")
9393

9494
if self.getImageVersion() <= pkg_resources.parse_version("2.0"):
9595
self.skipTest("Not supported in 2.0 and earlier images")
@@ -118,8 +118,8 @@ def test_spark_rapids_sql(self, configuration, machine_suffixes, accelerator):
118118
def test_non_default_cuda_versions(self, configuration, machine_suffixes,
119119
accelerator, cuda_version, driver_version):
120120

121-
if self.getImageOs() == "rocky":
122-
self.skipTest("Not supported for Rocky OS")
121+
if self.getImageVersion() > pkg_resources.parse_version("2.0") and self.getImageOs() == "rocky":
122+
self.skipTest("Not supported for Rocky 9")
123123

124124
if self.getImageVersion() <= pkg_resources.parse_version("2.0"):
125125
self.skipTest("Not supported in 2.0 and earlier images")

0 commit comments

Comments
 (0)