-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Add new dataproc_magics module providing %pip install features
#175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @dborowitz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement for Dataproc users by enabling the installation of Python packages directly from Google Cloud Storage (GCS) within IPython environments. By providing a custom Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new dataproc_magics module to enable %pip install from GCS URLs, which is a great feature for Dataproc users. The implementation looks solid, correctly shadowing the existing %pip magic and handling GCS downloads in a temporary directory. The code is well-structured into __init__.py for the extension hooks, _internal/magic.py for the magic logic, and _internal/dl.py for the download helper.
I've left a couple of minor suggestions for improvement:
- In
dl.py, to improve error reporting during temporary file cleanup. - In the
papermillintegration test, to enhance debuggability on test failures.
Regarding the integration testing strategy, I agree with your preference for the papermill approach (test_magics_with_papermill.py). It provides the best end-to-end validation in a realistic notebook environment while abstracting away the complexities of the Jupyter protocol, making the tests cleaner and more maintainable. I recommend proceeding with this approach and removing the other two integration test files (test_magics.py and test_magics_with_ipython.py).
Overall, this is a well-executed feature addition. Great work!
| result = subprocess.run(cmd, text=True) | ||
| assert ( | ||
| result.returncode == 0 | ||
| ), f"Papermill execution failed with exit code {result.returncode}.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better debuggability of test failures, it would be helpful to capture and display stdout and stderr from the papermill subprocess if it fails. This will make it much easier to understand why the notebook execution failed.
| result = subprocess.run(cmd, text=True) | |
| assert ( | |
| result.returncode == 0 | |
| ), f"Papermill execution failed with exit code {result.returncode}.\n" | |
| result = subprocess.run(cmd, text=True, capture_output=True) | |
| assert ( | |
| result.returncode == 0 | |
| ), f"Papermill execution failed with exit code {result.returncode}.\nStdout:\n{result.stdout}\nStderr:\n{result.stderr}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary; without capturing, it streams directly to stdout/stderr. This (along with a lot of pip output) is normally suppressed by pytest, but with pytest -s you can actually see the papermill progress bar move.
Here is what I see running a single test function with -s:
Successfully installed dataproc-spark-connect-1.0.1 google-cloud-dataproc-5.23.0 grpc-google-iam-v1-0.14.3 grpcio-1.76.0 grpcio-status-1.76.0 numpy-2.4.0 pandas-2.3.3 py4j-0.10.9.9 pyarrow-22.0.0 pyspark-4.0.1 pytz-2025.2 tzdata-2025.3 webs
ockets-15.0.1
Executing notebook with papermill: /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/input.ipynb -> /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/output.ipynb
Input Notebook: /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/input.ipynb
Output Notebook: /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k/output.ipynb
Executing: 0%| | 0/5 [00:00<?, ?cell/s]
Executing notebook with kernel: python3
Executing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.42cell/s]
.Cleaning up temp dir /var/folders/sj/165_57tx00v_4m0jxcclyvrw000g2d/T/dataproc-magics-pm-jzhceu3k
ca0a794 to
9023bc8
Compare
The initial goal of this module is to shadow the built-in `%pip` magic to support installing from `gs://` URLs in addition to local files and `https://` URLs. (Other magic implementations may be added later.) In this commit, we support only the following constructions: - %pip install gs://bucket/mypackage.whl - %pip install -r gs://bucket/myrequirements.txt Recursively handling `gs://` URLs inside requirements files is not yet supported.
|
|
||
| @magic.line_magic | ||
| def pip(self, line: str) -> None: | ||
| if "gs://" in line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if "install" is also present in case there are some other custom flags that may accept a GCS url?
The initial goal of this module is to shadow the built-in
%pipmagic to support installing fromgs://URLs in addition to local files andhttps://URLs. (Other magic implementations may be added later.)In the initial commit, we support only the following constructions:
Recursively handling
gs://URLs inside requirements files is not yet supported.Background
Dataproc customers want to be able to install from custom requirements files and/or wheels on private storage without the overhead of setting up Google Artifact Registry or similar. Doing so from within a Dataproc cluster or Serverless Spark runtime is easiest if the storage is GCS; generally speaking, customers are familiar with managing private buckets for this reason.
In an ideal world, this functionality would come straight from pip, for example by an extension mechanism to support
gs://URLs directly, or more customizable authentication to support HTTPS URLs (see pypa/pip#8042 and pypa/pip#4475). So in a sense, putting the functionality in this repo is a stopgap; on the other hand, reading the comments on those issues, these changes to pip are unlikely to land soon.Open questions
These need to be resolved before this PR is merged.
Is this the right repo to package it in? We've had some discussion internally, and this repo is a reasonable candidate since it doesn't require the overhead of a new repo. Given that this is something of a stopgap, we don't need to overthink it.
Should we install only via an extra? Like
dataproc-spark-connect[magic]How should integration tests be implemented? The current implementation has 3 rather different alternatives; we should pick one during review, and delete the others.
test_magics_with_ipython.py: Use the IPython API to start a kernel; use the wire protocol to extract cell output.test_magics.py: Use jupyter_kernel_test, which also relies on the kernel manager API, but provides a helper function hiding the protocol details we would otherwise implement manually.test_magics_with_papermill.py: Programmatically create ipynbs and run them with papermill.Artifact setup for integration tests. Currently, tests must be run with:
I don't love the way I had to a) populate a test bucket with specific artifacts, b) validate them against known safe/correct values. But the alternative of uploading new objects to a writable bucket at test runtime also seems like a pain. Open to alternative suggestions here.
Future work
These may be addressed in future PRs. Examples:
gs://URLs insidegs://requirements files. This requires implementing some recursion.gs://URLs fromaddArtifactfor Spark Connect sessionsaddArtifacton any in-scope Spark Connect sessions during%pip install.gs://URLs inside local requirements files, e.g.-r requirements.txtwhere requirements.txt is a local file containing-r gs://bucket/requirements.txt. This opens up a whole can of worms because we need robust command-line parsing, we can't just search and replacegs://URLs.gs://URLs inside remote (https) requirements files. This requires respecting pip's auth configurations.(This is mostly illustrative; detailed discussion of these features should happen outside this PR.)