Skip to content

Conversation

@dborowitz
Copy link

@dborowitz dborowitz commented Jan 5, 2026

The initial goal of this module is to shadow the built-in %pip magic to support installing from gs:// URLs in addition to local files and https:// URLs. (Other magic implementations may be added later.)

In the initial commit, we support only the following constructions:

  • %load_ext google.cloud.dataproc_magics
  • %pip install gs://bucket/mypackage.whl
  • %pip install -r gs://bucket/myrequirements.txt

Recursively handling gs:// URLs inside requirements files is not yet supported.

Background

Dataproc customers want to be able to install from custom requirements files and/or wheels on private storage without the overhead of setting up Google Artifact Registry or similar. Doing so from within a Dataproc cluster or Serverless Spark runtime is easiest if the storage is GCS; generally speaking, customers are familiar with managing private buckets for this reason.

In an ideal world, this functionality would come straight from pip, for example by an extension mechanism to support gs:// URLs directly, or more customizable authentication to support HTTPS URLs (see pypa/pip#8042 and pypa/pip#4475). So in a sense, putting the functionality in this repo is a stopgap; on the other hand, reading the comments on those issues, these changes to pip are unlikely to land soon.

Open questions

These need to be resolved before this PR is merged.

  1. Is this the right repo to package it in? We've had some discussion internally, and this repo is a reasonable candidate since it doesn't require the overhead of a new repo. Given that this is something of a stopgap, we don't need to overthink it.

  2. Should we install only via an extra? Like dataproc-spark-connect[magic]

  3. How should integration tests be implemented? The current implementation has 3 rather different alternatives; we should pick one during review, and delete the others.

    • test_magics_with_ipython.py: Use the IPython API to start a kernel; use the wire protocol to extract cell output.
      • Pro: kernel manager API is conceptually the right fit.
      • Con: need to handle quite a few bits of the wire protocol, e.g. content types, execution states, etc.
    • test_magics.py: Use jupyter_kernel_test, which also relies on the kernel manager API, but provides a helper function hiding the protocol details we would otherwise implement manually.
      • Pro: protocol is abstracted away
      • Con: library is designed for validating kernel implementations, not magic implementations; among other things, the KernelTest class has a bunch of irrelevant test methods that get run every time.
    • [My current preference] test_magics_with_papermill.py: Programmatically create ipynbs and run them with papermill.
      • Pro: papermill doesn't expose the wire protocol, just produces cell outputs
      • Con: still need to understand the notebook file structure to generate test notebooks and parse outputs
  4. Artifact setup for integration tests. Currently, tests must be run with:

    DATAPROC_TEST_BUCKET='dborowitz-public' .venv/bin/pytest -n auto --tb=auto -v 'tests/integration/dataproc_magics/test_magics_with_papermill.py'
    

    I don't love the way I had to a) populate a test bucket with specific artifacts, b) validate them against known safe/correct values. But the alternative of uploading new objects to a writable bucket at test runtime also seems like a pain. Open to alternative suggestions here.

Future work

These may be addressed in future PRs. Examples:

  • Handle gs:// URLs inside gs:// requirements files. This requires implementing some recursion.
  • Handle gs:// URLs from addArtifact for Spark Connect sessions
  • Automagically call addArtifact on any in-scope Spark Connect sessions during %pip install.
  • Handle gs:// URLs inside local requirements files, e.g. -r requirements.txt where requirements.txt is a local file containing -r gs://bucket/requirements.txt. This opens up a whole can of worms because we need robust command-line parsing, we can't just search and replace gs:// URLs.
  • Handle gs:// URLs inside remote (https) requirements files. This requires respecting pip's auth configurations.

(This is mostly illustrative; detailed discussion of these features should happen outside this PR.)

@dborowitz dborowitz requested review from medb and ojarjur January 5, 2026 21:56
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dborowitz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement for Dataproc users by enabling the installation of Python packages directly from Google Cloud Storage (GCS) within IPython environments. By providing a custom %pip install magic, users can now seamlessly leverage GCS for hosting private or custom Python dependencies, streamlining their workflow without the need for additional artifact registries. The implementation involves a temporary download mechanism for GCS-hosted files, ensuring compatibility with standard pip operations, and is accompanied by comprehensive testing strategies to validate its functionality.

Highlights

  • New dataproc_magics Module: A new module, google.cloud.dataproc_magics, has been introduced to extend IPython functionality for Dataproc users.
  • GCS Support for %pip install: The %pip install magic command now supports installing Python packages and requirements directly from Google Cloud Storage (GCS) URLs (e.g., gs://bucket/mypackage.whl or gs://bucket/myrequirements.txt).
  • Shadowing Built-in %pip Magic: The new functionality works by shadowing the existing IPython %pip magic, intercepting GCS URLs, downloading the files to a temporary local path, and then passing the local path to the original pip command.
  • GCS Downloader Utility: A GcsDownloader utility has been added to handle the temporary downloading of files from GCS, ensuring proper cleanup after use.
  • Multiple Integration Test Approaches: Three different integration test files (test_magics.py, test_magics_with_ipython.py, test_magics_with_papermill.py) have been added to explore various methods for testing IPython magics, with papermill being the author's current preference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new dataproc_magics module to enable %pip install from GCS URLs, which is a great feature for Dataproc users. The implementation looks solid, correctly shadowing the existing %pip magic and handling GCS downloads in a temporary directory. The code is well-structured into __init__.py for the extension hooks, _internal/magic.py for the magic logic, and _internal/dl.py for the download helper.

I've left a couple of minor suggestions for improvement:

  • In dl.py, to improve error reporting during temporary file cleanup.
  • In the papermill integration test, to enhance debuggability on test failures.

Regarding the integration testing strategy, I agree with your preference for the papermill approach (test_magics_with_papermill.py). It provides the best end-to-end validation in a realistic notebook environment while abstracting away the complexities of the Jupyter protocol, making the tests cleaner and more maintainable. I recommend proceeding with this approach and removing the other two integration test files (test_magics.py and test_magics_with_ipython.py).

Overall, this is a well-executed feature addition. Great work!

result = subprocess.run(cmd, text=True)
assert (
result.returncode == 0
), f"Papermill execution failed with exit code {result.returncode}.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better debuggability of test failures, it would be helpful to capture and display stdout and stderr from the papermill subprocess if it fails. This will make it much easier to understand why the notebook execution failed.

Suggested change
result = subprocess.run(cmd, text=True)
assert (
result.returncode == 0
), f"Papermill execution failed with exit code {result.returncode}.\n"
result = subprocess.run(cmd, text=True, capture_output=True)
assert (
result.returncode == 0
), f"Papermill execution failed with exit code {result.returncode}.\nStdout:\n{result.stdout}\nStderr:\n{result.stderr}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary; without capturing, it streams directly to stdout/stderr. This (along with a lot of pip output) is normally suppressed by pytest, but with pytest -s you can actually see the papermill progress bar move.

Here is what I see running a single test function with -s:

Successfully installed dataproc-spark-connect-1.0.1 google-cloud-dataproc-5.23.0 grpc-google-iam-v1-0.14.3 grpcio-1.76.0 grpcio-status-1.76.0 numpy-2.4.0 pandas-2.3.3 py4j-0.10.9.9 pyarrow-22.0.0 pyspark-4.0.1 pytz-2025.2 tzdata-2025.3 webs
ockets-15.0.1
Executing notebook with papermill: /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/input.ipynb -> /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/output.ipynb
Input Notebook:  /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/input.ipynb
Output Notebook: /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/output.ipynb
Executing:   0%|                                                                                                                                                                                                        | 0/5 [00:00<?, ?cell/s]
Executing notebook with kernel: python3
Executing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.42cell/s]
.Cleaning up temp dir /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k

@dborowitz dborowitz force-pushed the magics branch 3 times, most recently from ca0a794 to 9023bc8 Compare January 6, 2026 20:54
The initial goal of this module is to shadow the built-in `%pip` magic
to support installing from `gs://` URLs in addition to local files and
`https://` URLs. (Other magic implementations may be added later.)

In this commit, we support only the following constructions:

- %pip install gs://bucket/mypackage.whl
- %pip install -r gs://bucket/myrequirements.txt

Recursively handling `gs://` URLs inside requirements files is not yet
supported.

@magic.line_magic
def pip(self, line: str) -> None:
if "gs://" in line:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if "install" is also present in case there are some other custom flags that may accept a GCS url?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants