Skip to content

Conversation

@jinnthehuman
Copy link

We use the Notebook service client to execute colab notebooks as a means to test spark connect e2e functionality in notebooks. The notebook used for testing installs the latest version of dataproc-spark-connect-python and runs a demo spark notebook that BQ Studio has.

Uses the google.com:hadoop-cloud-dev project
Uses the notebook uploaded to GCS gs://e2e-testing-bucket/input/notebooks/spark_connect_e2e_notebook_test.ipynb

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jinnthehuman, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new end-to-end integration test aimed at verifying the seamless operation of Spark Connect within Colab notebooks. The test orchestrates the execution of a predefined Colab notebook using the Notebook service client, which then installs necessary Spark Connect components and runs a demo. This ensures that the entire workflow, from notebook invocation to successful Spark job completion, functions correctly in a real-world scenario.

Highlights

  • New E2E Test: Introduced a new integration test (test_colab_execution.py) specifically designed to validate Spark Connect functionality within Colab notebooks.
  • Notebook Execution via Service Client: The test leverages the aiplatform_v1.NotebookServiceClient to programmatically trigger and monitor the execution of a Colab notebook.
  • Spark Connect Validation: The executed notebook is configured to install the dataproc-spark-connect-python library and run a demo Spark notebook, ensuring end-to-end functionality.
  • Dynamic Resource Configuration: The test dynamically retrieves project, region, and service account details from environment variables, enhancing flexibility and reusability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds an integration test for executing a Colab notebook using Spark Connect. My review found a critical issue where the test incorrectly checks for a failure state instead of a success state. I've also identified a significant inefficiency in how the test result is verified, which could lead to flaky tests. Additionally, I've included several recommendations to improve the code's maintainability and adherence to Python best practices, such as removing hardcoded values, using a standard logging mechanism, and fixing minor style issues.

Comment on lines +20 to +21
REPOSITORY_ID = "97193e1e-c5d1-4ce8-bc6f-cf206c701624"
TEMPLATE_ID = "6409629422399258624"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test contains several hardcoded values like REPOSITORY_ID and TEMPLATE_ID (and GCS URIs on lines 70 and 74). This makes the test less flexible and harder to maintain or reuse in different environments. Consider moving these values to environment variables, similar to how GOOGLE_CLOUD_PROJECT is handled.

test_parent = f"projects/{test_project}/locations/{test_region}"
test_execution_display_name = f"spark-connect-e2e-notebook-test-{uuid.uuid4().hex}"

print(f"Starting notebook execution job with display name: {test_execution_display_name}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test uses print() for logging progress. While acceptable for debugging, it's better practice to use Python's logging module for tests. This provides more control over log levels and output formatting, and integrates well with testing frameworks like pytest (e.g., using the caplog fixture).

@medb medb changed the title Add test that executes a colab notebook that uses spark connect test: Add test that executes a colab notebook that uses spark connect Oct 15, 2025
@fangyh20
Copy link
Member

can we address the gemini comments and run the integration test?

@jinnthehuman jinnthehuman changed the base branch from main to jameshonglee-e2e-test October 16, 2025 23:39
@jinnthehuman jinnthehuman merged commit 79909dc into GoogleCloudDataproc:jameshonglee-e2e-test Oct 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants