-
Notifications
You must be signed in to change notification settings - Fork 515
[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[spark-rapids] Add init script support for updating SPARK RAPIDS jar on Dataproc 2.3 ML #1372
Conversation
- Introduced `install_gpu_xgboost` and added a new function to check for existing RAPIDS JARs. - Introduced `remove_spark_rapids_jar` to clean up existing JARs before installation. - Updated the main function to ensure the correct RAPIDS version is installed, replacing any existing JARs as necessary. - Improved overall structure and readability of the script.
- Modified the `remove_spark_rapids_jar` function to use a wildcard for matching RAPIDS JAR files, allowing for more flexible removal of existing JARs. - Ensured the main function is properly terminated with a newline for better script formatting.
Summary of ChangesHello @SurajAralihalli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Spark RAPIDS initialization script to intelligently handle Dataproc 2.3 ML images. By introducing checks for existing Spark RAPIDS installations and providing mechanisms to update them, the script ensures that the correct RAPIDS version is deployed efficiently, avoiding unnecessary driver installations on environments where they are already present. This streamlines the setup process for Spark RAPIDS on Dataproc ML clusters. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@cjac @jayadeep-jayaraman for reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Spark RAPIDS installation script to support Dataproc ML images, which come with Spark RAPIDS pre-installed. The approach of checking for an existing JAR and only updating it is sound. I've identified a high-severity issue where YARN services are not restarted after the JAR is updated on ML images, which would prevent the changes from taking effect. I've also included a medium-severity suggestion to improve the robustness of the script.
| function main() { | ||
| if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then | ||
| remove_old_backports | ||
| fi | ||
| check_os_and_secure_boot | ||
| setup_gpu_yarn | ||
| if [[ "${RUNTIME}" == "SPARK" ]]; then | ||
| # If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version | ||
| # ML images by default have Spark RAPIDS and GPU drivers installed | ||
| if check_spark_rapids_jar; then | ||
| # This ensures the cluster always uses the desired RAPIDS version, even if a default is present | ||
| remove_spark_rapids_jar | ||
| install_spark_rapids | ||
| configure_spark | ||
| echo "RAPIDS initialized with Spark runtime" | ||
| echo "RAPIDS Spark RAPIDS JAR replaced successfully" | ||
| else | ||
| echo "Unsupported RAPIDS Runtime: ${RUNTIME}" | ||
| exit 1 | ||
| fi | ||
| # Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images | ||
| if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then | ||
| remove_old_backports | ||
| fi | ||
| check_os_and_secure_boot | ||
| setup_gpu_yarn | ||
| if [[ "${RUNTIME}" == "SPARK" ]]; then | ||
| install_spark_rapids | ||
| install_gpu_xgboost | ||
| configure_spark | ||
| echo "RAPIDS initialized with Spark runtime" | ||
| else | ||
| echo "Unsupported RAPIDS Runtime: ${RUNTIME}" | ||
| exit 1 | ||
| fi | ||
|
|
||
| for svc in resourcemanager nodemanager; do | ||
| if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then | ||
| systemctl restart hadoop-yarn-${svc}.service | ||
| for svc in resourcemanager nodemanager; do | ||
| if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then | ||
| systemctl restart hadoop-yarn-${svc}.service | ||
| fi | ||
| done | ||
| if is_debian || is_ubuntu ; then | ||
| apt-get clean | ||
| fi | ||
| done | ||
| if is_debian || is_ubuntu ; then | ||
| apt-get clean | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YARN services (resourcemanager, nodemanager) need to be restarted for Spark to pick up the newly installed/updated RAPIDS JAR. In the current implementation, this restart only happens in the else block (for non-ML images). When a JAR is updated on an ML image (the if block), the services are not restarted, which means the changes won't take effect until a manual restart. The service restart loop should be moved outside the if/else block to ensure it runs in both scenarios.
function main() {
# If the RAPIDS Spark RAPIDS JAR is already installed (common on ML images), replace it with the requested version
# ML images by default have Spark RAPIDS and GPU drivers installed
if check_spark_rapids_jar; then
# This ensures the cluster always uses the desired RAPIDS version, even if a default is present
remove_spark_rapids_jar
install_spark_rapids
echo "RAPIDS Spark RAPIDS JAR replaced successfully"
else
# Install GPU drivers and setup SPARK RAPIDS JAR for non-ML images
if is_debian && [[ $(echo "${DATAPROC_IMAGE_VERSION} <= 2.1" | bc -l) == 1 ]]; then
remove_old_backports
fi
check_os_and_secure_boot
setup_gpu_yarn
if [[ "${RUNTIME}" == "SPARK" ]]; then
install_spark_rapids
install_gpu_xgboost
configure_spark
echo "RAPIDS initialized with Spark runtime"
else
echo "Unsupported RAPIDS Runtime: ${RUNTIME}"
exit 1
fi
if is_debian || is_ubuntu ; then
apt-get clean
fi
fi
for svc in resourcemanager nodemanager; do
if [[ $(systemctl show hadoop-yarn-${svc}.service -p SubState --value) == 'running' ]]; then
systemctl restart hadoop-yarn-${svc}.service
fi
done
}| local jars_found | ||
| jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l) | ||
| if [[ $jars_found -gt 0 ]]; then | ||
| echo "RAPIDS Spark plugin JAR found" | ||
| return 0 | ||
| else | ||
| echo "RAPIDS Spark plugin JAR not found" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ls | wc -l to check for file existence is not fully robust. For instance, it can fail in unexpected ways if file names contain newlines (though unlikely here) and it suppresses all errors from ls. A more robust and idiomatic way to check if any files match a glob pattern in bash is to use compgen -G.
| local jars_found | |
| jars_found=$(ls /usr/lib/spark/jars/rapids-4-spark_*.jar 2>/dev/null | wc -l) | |
| if [[ $jars_found -gt 0 ]]; then | |
| echo "RAPIDS Spark plugin JAR found" | |
| return 0 | |
| else | |
| echo "RAPIDS Spark plugin JAR not found" | |
| return 1 | |
| fi | |
| if compgen -G "/usr/lib/spark/jars/rapids-4-spark_*.jar" > /dev/null; then | |
| echo "RAPIDS Spark plugin JAR found" | |
| return 0 | |
| else | |
| echo "RAPIDS Spark plugin JAR not found" | |
| return 1 | |
| fi |
Dataproc ML images (
2.3-ml-ubuntu) come with NVIDIA drivers and Spark RAPIDS preinstalled by default, so the init script should only update the RAPIDS JAR and skip all other setup steps.This PR:
install_gpu_xgboostand added a new function to check for existing RAPIDS JARs.remove_spark_rapids_jarto clean up existing JARs before installation.mainfunction to ensure the correct RAPIDS version is installed, and skips NVIDIA driver installation when the instance already includes the Spark RAPIDS jar (applies to ML images).