-
Notifications
You must be signed in to change notification settings - Fork 12
test: Add test that executes a colab notebook that uses spark connect #151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Add test that executes a colab notebook that uses spark connect #151
Conversation
Summary of ChangesHello @jinnthehuman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new end-to-end integration test aimed at verifying the seamless operation of Spark Connect within Colab notebooks. The test orchestrates the execution of a predefined Colab notebook using the Notebook service client, which then installs necessary Spark Connect components and runs a demo. This ensures that the entire workflow, from notebook invocation to successful Spark job completion, functions correctly in a real-world scenario. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds an integration test for executing a Colab notebook using Spark Connect. My review found a critical issue where the test incorrectly checks for a failure state instead of a success state. I've also identified a significant inefficiency in how the test result is verified, which could lead to flaky tests. Additionally, I've included several recommendations to improve the code's maintainability and adherence to Python best practices, such as removing hardcoded values, using a standard logging mechanism, and fixing minor style issues.
| REPOSITORY_ID = "97193e1e-c5d1-4ce8-bc6f-cf206c701624" | ||
| TEMPLATE_ID = "6409629422399258624" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test contains several hardcoded values like REPOSITORY_ID and TEMPLATE_ID (and GCS URIs on lines 70 and 74). This makes the test less flexible and harder to maintain or reuse in different environments. Consider moving these values to environment variables, similar to how GOOGLE_CLOUD_PROJECT is handled.
| test_parent = f"projects/{test_project}/locations/{test_region}" | ||
| test_execution_display_name = f"spark-connect-e2e-notebook-test-{uuid.uuid4().hex}" | ||
|
|
||
| print(f"Starting notebook execution job with display name: {test_execution_display_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
can we address the gemini comments and run the integration test? |
79909dc
into
GoogleCloudDataproc:jameshonglee-e2e-test
We use the Notebook service client to execute colab notebooks as a means to test spark connect e2e functionality in notebooks. The notebook used for testing installs the latest version of dataproc-spark-connect-python and runs a demo spark notebook that BQ Studio has.
Uses the google.com:hadoop-cloud-dev project
Uses the notebook uploaded to GCS gs://e2e-testing-bucket/input/notebooks/spark_connect_e2e_notebook_test.ipynb