-
-
Notifications
You must be signed in to change notification settings - Fork 631
feat(pip): global downloader settings #3322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(pip): global downloader settings #3322
Conversation
@dougthor42, could you please comment on how you imagine the API work for the downloader configuration? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm a little concerned with forcing all hubs to use the same index_url
and index_overrides
settings, as there are use cases where one hub (say a python 3.12 hub) needs to pull from a different index than another hub (eg py3.13).
But maybe I'm misunderstanding the proposal and it's still possible to set things on a per-hub basis.
Another thing I'd like to see - which admittedly is probably out of scope - is regex-based index_overrides
. For example, we have many internal packages prefixed with qh-
or pyle-
, and having to list all those out in index_overrides
is a little cumbersome.
All that said, I think that the core API of (bazel_)downloader
, index_url
, and index_overrides
is pretty reasonable.
The indexes must support Simple API as described here: | ||
<https://packaging.python.org/en/latest/specifications/simple-repository-api/> | ||
If `skip` is used as a value, then we will not use the downloader for a particular package. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting to me. Can you elaborate on what problem it's trying to solve?
I think I like the idea, but I'm concerned that it'll cause hard-to-debug issues (such as having correct auth for Bazel downloader but not for the non-Bazel downloader case, or vice-versa).
The indexes must support Simple API as described here: | ||
<https://packaging.python.org/en/latest/specifications/simple-repository-api/> | ||
If `skip` is used as a value, then we will not use the downloader for a particular package. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
If `skip` is used as a value, then we will not use the downloader for a particular package. The | |
If `skip` is used as a value, then we will not use the Bazel downloader for a particular package. The |
::: | ||
""", | ||
), | ||
"index_overrides": attr.string_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about cases where packages of a particular version must be downloaded from index A, while others versions must be downloaded from index B?
index_overrides = {
"friendly-bard<=1.0.0,>3.0": INDEX_A,
"friendly-bard": INDEX_B,
}
Or where windows versions are on a different index from linux versions?
TBH I'm not sure how useful such a thing would be, but I could see edge cases requiring it such as when a private index isn't yet updated with the latest version of a package but users must absolutely have the latest version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the use case? Why would an index have version 1 but not version 2? Or the distributions for windows, but not linux?
The case I can think of is e.g. cuda. Torch has different index urls for different cuda etc builds, but the env marker language can't express that anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real-life example is airlock/, which has a 2 week soak time for new releases: someone deploys Version 2 to PyPI but it will not show up in Airlock for two weeks - only Version 1 is available until the new version gets vetted and whatnot.
A made-up example is that a package was migrated from INDEX_A to INDEX_B and because $REASON, historical artifacts could not be ported. I highly doubt such a case would occur.
I can't think of any logical reason why an index would only have artifacts of a given OS.
Admittedly this is mostly just theorycrafting. Even in the Airlock case, we just tell people to wait the 2 weeks...
::: | ||
""", | ||
), | ||
"index_url": attr.string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan on dropping extra_index_urls
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that extra index urls can come from pip args and requirements files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so extending that premise: why not also pull index_url
from the requirements.txt files too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that we have to have a default somewhere if the index is not specified. If it is specified in the lock file, then we can either:
- Use them as extra index_urls.
- Use them and ignore the default that we set.
- Use PyPI as a fallback which gets used if else fails.
At least this is for requirements.txt
, for PEP751 lock files the index is there always (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use [
index_url
specified in the lock file] as extra index_urls
I fear that this would risk package name collisions and would be difficult to figure out the index ordering.
Use them and ignore the default that we set.
This sounds reasonable to me, but then a root module will not be able to override a child module's config (which I believe is a desired feature) without adjusting the child's requirements.txt
directly:
./MODULE.bazel # sets pip.default(downloader = "enabled", index_url = BAR)
./requirements.txt # sets --index-url FOO
./child_module/
MODULE.bazel # sets pip.default(downloader = "enabled", index_url = AAA)
requirements.txt # sets --index-url BAD
If we ignore pip.default
, the root MODULE.bazel
will pull from FOO
and the child will pull from BAD
. Right?
Use PyPI as a fallback which gets used if else fails.
If I set index_url
in any of my requirements.txt files, I do not want pypi used at all.
$ head requirements.txt
# This file was autogenerated by uv via the following command:
# uv pip compile --python-version=3.12 --output-file=requirements.txt requirements.in
--index-url https://[email protected]/REDACTED/simple
--extra-index-url https://[email protected]/REDACTED/simple
absl-py==2.3.0 \
for PEP751 lock files the index is there always (I think).
It's optional 😞
I think it's reasonable to put all index config in Bazel and completely ignore any index config set in requirements.txt or PEP751 pylock.toml
files.
Option values: | ||
* `disabled` - disable downloader. | ||
* `debug` - stop using parallel downloading for debugging cases. | ||
* `enabled` - enable downloader for all hubs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled
would only apply to the hub that it's listed on, right?
pip.parse(hub_name="foo", downloader="enabled", ...)
pip.parse(hub_name="bar", downloader="disabled", ...)
Oh this is the global config as part of _default_attrs
and the pypi module_extension
, not the .parse
tag class. I didn't notice that at first.
So this would be used via:
pip = use_extension(@rules_python//python/extensions:pip.bzl, "pip")
pip.default(downloader = "enabled", index_overrides = {...})
pip.parse(...)
Correct?
Would the pip.parse
still support extending/overriding index_overrides
and indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering about the index_overrides
- the root module should be able to override anything here. However, what happens if a non-root module is defining this override. To do this blindly is a little sweat-inducing because of all of the recent supply chain attacks the public registries faced. I would say that we should keep index_overrides
as a root-module-only option. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that non-root module overrides is probably a Bad Idea ™️. I'm more asking about the pip.parse overrides:
pip.default(
downloader = "enabled",
index_url = AIRLOCK_INDEX_URL,
index_overrides = {"foo": INDEX_A, "bar": INDEX_A},
)
pip.parse(
# this parse pulls from 2 indexes: AIRLOCK_INDEX_URL and INDEX_A
hub_name = "pypi",
python_version = PYTHON_CURRENT_VERSION,
requirements_lock = "//:requirements.txt",
)
pip.parse(
# this parse pulls from 3 indexes: AIRLOCK_INDEX_URL for most things,
# INDEX_A for 'bar' package, and INDEX_B for 'foo' package
experimental_index_url_overrides = {"foo": INDEX_B}, # same package, different index
hub_name = "pypi",
python_version = PYTHON_NEXT_VERSION,
requirements_lock = "//:requirements_pynext.txt",
)
Like elsewhere, this is mostly hypothetical - I personally haven't had a need to use different indexes for different (python versions | hubs | requirement lock files).
doc = """\ | ||
Option values: | ||
* `disabled` - disable downloader. | ||
* `debug` - stop using parallel downloading for debugging cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this also increase logging levels? Or is that separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we can get rid of parallel downloading. Could not think of a better name for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think debug
is fine for such a case.
Having a dedicated serial
, download_serially
, sequential
, etc. value might be too restrictive.
} | AUTH_ATTRS | ||
} | AUTH_ATTRS | { | ||
# Downloader options | ||
"downloader": attr.string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say bazel_downloader
is better. As is, it kinda sounds like it would modify downloading altogether (similar to pip.parse.download_only
).
default = "auto", | ||
doc = """\ | ||
Option values: | ||
* `disabled` - disable downloader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: mention that pip
will be used to download things instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, except instead of pip, be vague and say some external tool
::: | ||
""", | ||
), | ||
"index_overrides": attr.string_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real-life example is airlock/, which has a 2 week soak time for new releases: someone deploys Version 2 to PyPI but it will not show up in Airlock for two weeks - only Version 1 is available until the new version gets vetted and whatnot.
A made-up example is that a package was migrated from INDEX_A to INDEX_B and because $REASON, historical artifacts could not be ported. I highly doubt such a case would occur.
I can't think of any logical reason why an index would only have artifacts of a given OS.
Admittedly this is mostly just theorycrafting. Even in the Airlock case, we just tell people to wait the 2 weeks...
::: | ||
""", | ||
), | ||
"index_url": attr.string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so extending that premise: why not also pull index_url
from the requirements.txt files too?
doc = """\ | ||
Option values: | ||
* `disabled` - disable downloader. | ||
* `debug` - stop using parallel downloading for debugging cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think debug
is fine for such a case.
Having a dedicated serial
, download_serially
, sequential
, etc. value might be too restrictive.
packages come from which source. We want to avoid issues similar to what happened in | ||
https://pytorch.org/blog/compromised-nightly-dependency/. | ||
The indexes must support Simple API as described here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are url-based requirements specifiers in scope? For example pip @ https://github.com/pypa/pip/archive/22.0.2.zip
is equivalent to
index_overrides = {
"pip": "https://github.com/pypa/pip/archive/22.0.2.zip"
}
Note: IMO this should not be valid, I'm just trying to get clarification.
Initial API proposal for a stable API for the
experimental_index_url
.TODO for this PR:
sketch in docs in how to or in the attributes.
experimental_*index*
flagsWIP
#2951