Skip to content

Conversation

@SurajAralihalli
Copy link
Contributor

Dataproc ML images (2.3-ml-ubuntu) come with NVIDIA drivers and Spark RAPIDS preinstalled by default, so the init script should only update the RAPIDS JAR and skip all other setup steps.

This PR:

  • Introduces install_gpu_xgboost and added a new function to check for existing RAPIDS JARs.
  • Introduces remove_spark_rapids_jar to clean up existing JARs before installation.
  • Updates the main function to ensure the correct RAPIDS version is installed, and skips NVIDIA driver installation when the instance already includes the Spark RAPIDS jar (applies to ML images).

- Introduced `install_gpu_xgboost` and added a new function to check for existing RAPIDS JARs.
- Introduced `remove_spark_rapids_jar` to clean up existing JARs before installation.
- Updated the main function to ensure the correct RAPIDS version is installed, replacing any existing JARs as necessary.
- Improved overall structure and readability of the script.
- Modified the `remove_spark_rapids_jar` function to use a wildcard for matching RAPIDS JAR files, allowing for more flexible removal of existing JARs.
- Ensured the main function is properly terminated with a newline for better script formatting.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SurajAralihalli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the Spark RAPIDS initialization script to intelligently handle Dataproc 2.3 ML images. By introducing checks for existing Spark RAPIDS installations and providing mechanisms to update them, the script ensures that the correct RAPIDS version is deployed efficiently, avoiding unnecessary driver installations on environments where they are already present. This streamlines the setup process for Spark RAPIDS on Dataproc ML clusters.

Highlights

  • Function Refactoring: The original install_spark_rapids function has been refactored. The logic for installing GPU XGBoost has been moved into a new function named install_gpu_xgboost, while install_spark_rapids now exclusively handles the Spark RAPIDS JAR installation.
  • JAR Detection and Removal Utilities: Two new utility functions, check_spark_rapids_jar and remove_spark_rapids_jar, have been introduced. These functions allow the script to detect if a Spark RAPIDS JAR is already present and to remove existing JARs, respectively.
  • Optimized Main Function Logic: The main function now includes conditional logic to optimize installation on Dataproc ML images. If a Spark RAPIDS JAR is detected (indicating a pre-installed ML image), the script will remove the existing JAR and install the specified version, skipping redundant NVIDIA driver installations. Otherwise, it proceeds with a full installation including GPU drivers and both Spark RAPIDS and XGBoost JARs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@SurajAralihalli
Copy link
Contributor Author

@cjac @jayadeep-jayaraman for reviews

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Spark RAPIDS installation script to support Dataproc ML images, which come with Spark RAPIDS pre-installed. The approach of checking for an existing JAR and only updating it is sound. I've identified a high-severity issue where YARN services are not restarted after the JAR is updated on ML images, which would prevent the changes from taking effect. I've also included a medium-severity suggestion to improve the robustness of the script.

Comment on lines 831 to 865
function main() {
if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
remove_old_backports
fi
check_os_and_secure_boot
setup_gpu_yarn
if [[ "${RUNTIME}" == "SPARK" ]]; then
# If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version
# ML images by default have Spark RAPIDS and GPU drivers installed
if check_spark_rapids_jar; then
# This ensures the cluster always uses the desired RAPIDS version, even if a default is present
remove_spark_rapids_jar
install_spark_rapids
configure_spark
echo "RAPIDS initialized with Spark runtime"
echo "RAPIDS Spark RAPIDS JAR replaced successfully"
else
echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
exit 1
fi
# Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images
if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
remove_old_backports
fi
check_os_and_secure_boot
setup_gpu_yarn
if [[ "${RUNTIME}" == "SPARK" ]]; then
install_spark_rapids
install_gpu_xgboost
configure_spark
echo "RAPIDS initialized with Spark runtime"
else
echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
exit 1
fi

for svc in resourcemanager nodemanager; do
if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
systemctl restart hadoop-yarn-${svc}.service
for svc in resourcemanager nodemanager; do
if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
systemctl restart hadoop-yarn-${svc}.service
fi
done
if is_debian || is_ubuntu ; then
apt-get clean
fi
done
if is_debian || is_ubuntu ; then
apt-get clean
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The YARN services (resourcemanager, nodemanager) need to be restarted for Spark to pick up the newly installed/updated RAPIDS JAR. In the current implementation, this restart only happens in the else block (for non-ML images). When a JAR is updated on an ML image (the if block), the services are not restarted, which means the changes won't take effect until a manual restart. The service restart loop should be moved outside the if/else block to ensure it runs in both scenarios.

function main() {
  # If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version
  # ML images by default have Spark RAPIDS and GPU drivers installed
  if check_spark_rapids_jar; then
    # This ensures the cluster always uses the desired RAPIDS version, even if a default is present
    remove_spark_rapids_jar
    install_spark_rapids
    echo "RAPIDS Spark RAPIDS JAR replaced successfully"
  else
    # Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images
    if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
      remove_old_backports
    fi
    check_os_and_secure_boot
    setup_gpu_yarn
    if [[ "${RUNTIME}" == "SPARK" ]]; then
      install_spark_rapids
      install_gpu_xgboost
      configure_spark
      echo "RAPIDS initialized with Spark runtime"
    else
      echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
      exit 1
    fi

    if is_debian || is_ubuntu ; then
      apt-get clean
    fi
  fi

  for svc in resourcemanager nodemanager; do
    if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
      systemctl restart hadoop-yarn-${svc}.service
    fi
  done
}

Comment on lines +284 to +292
local jars_found
jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l)
if [[ $jars_found -gt 0 ]]; then
echo "RAPIDS Spark plugin JAR found"
return 0
else
echo "RAPIDS Spark plugin JAR not found"
return 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using ls | wc -l to check for file existence is not fully robust. For instance, it can fail in unexpected ways if file names contain newlines (though unlikely here) and it suppresses all errors from ls. A more robust and idiomatic way to check if any files match a glob pattern in bash is to use compgen -G.

Suggested change
local jars_found
jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l)
if [[ $jars_found -gt 0 ]]; then
echo "RAPIDS Spark plugin JAR found"
return 0
else
echo "RAPIDS Spark plugin JAR not found"
return 1
fi
if compgen -G "/usr/lib/spark/jars/rapids-4-spark_*.jar" > /dev/null; then
echo "RAPIDS Spark plugin JAR found"
return 0
else
echo "RAPIDS Spark plugin JAR not found"
return 1
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant