[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372

SurajAralihalli · 2026-01-10T00:25:53Z

Dataproc ML images (2.3-ml-ubuntu) come with NVIDIA drivers and Spark RAPIDS preinstalled by default, so the init script should only update the RAPIDS JAR and skip all other setup steps.

This PR:

Introduces install_gpu_xgboost and added a new function to check for existing RAPIDS JARs.
Introduces remove_spark_rapids_jar to clean up existing JARs before installation.
Updates the main function to ensure the correct RAPIDS version is installed, and skips NVIDIA driver installation when the instance already includes the Spark RAPIDS jar (applies to ML images).

- Introduced `install_gpu_xgboost` and added a new function to check for existing RAPIDS JARs. - Introduced `remove_spark_rapids_jar` to clean up existing JARs before installation. - Updated the main function to ensure the correct RAPIDS version is installed, replacing any existing JARs as necessary. - Improved overall structure and readability of the script.

- Modified the `remove_spark_rapids_jar` function to use a wildcard for matching RAPIDS JAR files, allowing for more flexible removal of existing JARs. - Ensured the main function is properly terminated with a newline for better script formatting.

gemini-code-assist · 2026-01-10T00:26:09Z

Summary of Changes

Hello @SurajAralihalli, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Spark RAPIDS initialization script to intelligently handle Dataproc 2.3 ML images. By introducing checks for existing Spark RAPIDS installations and providing mechanisms to update them, the script ensures that the correct RAPIDS version is deployed efficiently, avoiding unnecessary driver installations on environments where they are already present. This streamlines the setup process for Spark RAPIDS on Dataproc ML clusters.

Highlights

Function Refactoring: The original install_spark_rapids function has been refactored. The logic for installing GPU XGBoost has been moved into a new function named install_gpu_xgboost, while install_spark_rapids now exclusively handles the Spark RAPIDS JAR installation.
JAR Detection and Removal Utilities: Two new utility functions, check_spark_rapids_jar and remove_spark_rapids_jar, have been introduced. These functions allow the script to detect if a Spark RAPIDS JAR is already present and to remove existing JARs, respectively.
Optimized Main Function Logic: The main function now includes conditional logic to optimize installation on Dataproc ML images. If a Spark RAPIDS JAR is detected (indicating a pre-installed ML image), the script will remove the existing JAR and install the specified version, skipping redundant NVIDIA driver installations. Otherwise, it proceeds with a full installation including GPU drivers and both Spark RAPIDS and XGBoost JARs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

SurajAralihalli · 2026-01-10T00:26:45Z

@cjac @jayadeep-jayaraman for reviews

gemini-code-assist

Code Review

This pull request refactors the Spark RAPIDS installation script to support Dataproc ML images, which come with Spark RAPIDS pre-installed. The approach of checking for an existing JAR and only updating it is sound. I've identified a high-severity issue where YARN services are not restarted after the JAR is updated on ML images, which would prevent the changes from taking effect. I've also included a medium-severity suggestion to improve the robustness of the script.

gemini-code-assist · 2026-01-10T00:28:11Z

spark-rapids/spark-rapids.sh

 function main() {
-  if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
-    remove_old_backports
-  fi
-  check_os_and_secure_boot
-  setup_gpu_yarn
-  if [[ "${RUNTIME}" == "SPARK" ]]; then
+  # If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version
+  # ML images by default have Spark RAPIDS and GPU drivers installed
+  if check_spark_rapids_jar; then
+    # This ensures the cluster always uses the desired RAPIDS version, even if a default is present
+    remove_spark_rapids_jar
    install_spark_rapids
-    configure_spark
-    echo "RAPIDS initialized with Spark runtime"
+    echo "RAPIDS Spark RAPIDS JAR replaced successfully"
  else
-    echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
-    exit 1
-  fi
+    # Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images
+    if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
+      remove_old_backports
+    fi
+    check_os_and_secure_boot
+    setup_gpu_yarn
+    if [[ "${RUNTIME}" == "SPARK" ]]; then
+      install_spark_rapids
+      install_gpu_xgboost
+      configure_spark
+      echo "RAPIDS initialized with Spark runtime"
+    else
+      echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
+      exit 1
+    fi

-  for svc in resourcemanager nodemanager; do
-    if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
-      systemctl restart hadoop-yarn-${svc}.service
+    for svc in resourcemanager nodemanager; do
+      if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
+        systemctl restart hadoop-yarn-${svc}.service
+      fi
+    done
+    if is_debian || is_ubuntu ; then
+      apt-get clean
    fi
-  done
-  if is_debian || is_ubuntu ; then
-    apt-get clean
  fi
 }


The YARN services (resourcemanager, nodemanager) need to be restarted for Spark to pick up the newly installed/updated RAPIDS JAR. In the current implementation, this restart only happens in the else block (for non-ML images). When a JAR is updated on an ML image (the if block), the services are not restarted, which means the changes won't take effect until a manual restart. The service restart loop should be moved outside the if/else block to ensure it runs in both scenarios.

function main() { # If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version # ML images by default have Spark RAPIDS and GPU drivers installed if check_spark_rapids_jar; then # This ensures the cluster always uses the desired RAPIDS version, even if a default is present remove_spark_rapids_jar install_spark_rapids echo "RAPIDS Spark RAPIDS JAR replaced successfully" else # Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then remove_old_backports fi check_os_and_secure_boot setup_gpu_yarn if [[ "${RUNTIME}" == "SPARK" ]]; then install_spark_rapids install_gpu_xgboost configure_spark echo "RAPIDS initialized with Spark runtime" else echo "Unsupported RAPIDS Runtime: ${RUNTIME}" exit 1 fi if is_debian || is_ubuntu ; then apt-get clean fi fi for svc in resourcemanager nodemanager; do if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then systemctl restart hadoop-yarn-${svc}.service fi done }

gemini-code-assist · 2026-01-10T00:28:11Z

spark-rapids/spark-rapids.sh

+  local jars_found
+  jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l)
+  if [[ $jars_found -gt 0 ]]; then
+    echo "RAPIDS Spark plugin JAR found"
+    return 0
+  else
+    echo "RAPIDS Spark plugin JAR not found"
+    return 1
+  fi


Using ls | wc -l to check for file existence is not fully robust. For instance, it can fail in unexpected ways if file names contain newlines (though unlikely here) and it suppresses all errors from ls. A more robust and idiomatic way to check if any files match a glob pattern in bash is to use compgen -G.

Suggested change

local jars_found

jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l)

if [[ $jars_found -gt 0 ]]; then

echo "RAPIDS Spark plugin JAR found"

return 0

else

echo "RAPIDS Spark plugin JAR not found"

return 1

fi

if compgen -G "/usr/lib/spark/jars/rapids-4-spark_*.jar" > /dev/null; then

echo "RAPIDS Spark plugin JAR found"

return 0

else

echo "RAPIDS Spark plugin JAR not found"

return 1

fi

SurajAralihalli added 2 commits January 9, 2026 15:55

gemini-code-assist bot reviewed Jan 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372

[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372

Uh oh!

SurajAralihalli commented Jan 10, 2026

Uh oh!

gemini-code-assist bot commented Jan 10, 2026

Uh oh!

SurajAralihalli commented Jan 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 10, 2026

Uh oh!

gemini-code-assist bot Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372

Are you sure you want to change the base?

[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372

Uh oh!

Conversation

SurajAralihalli commented Jan 10, 2026

Uh oh!

gemini-code-assist bot commented Jan 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

SurajAralihalli commented Jan 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant