Skip to content

Conversation

aignas
Copy link
Collaborator

@aignas aignas commented Oct 5, 2025

Initial API proposal for a stable API for the experimental_index_url.

TODO for this PR:

  • Agree on the interface and flow - it should be described either as a
    sketch in docs in how to or in the attributes.
  • Implement
  • Deprecation messages for the experimental_*index* flags

WIP

#2951

@aignas
Copy link
Collaborator Author

aignas commented Oct 5, 2025

@dougthor42, could you please comment on how you imagine the API work for the downloader configuration?

Copy link
Collaborator

@dougthor42 dougthor42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm a little concerned with forcing all hubs to use the same index_url and index_overrides settings, as there are use cases where one hub (say a python 3.12 hub) needs to pull from a different index than another hub (eg py3.13).

But maybe I'm misunderstanding the proposal and it's still possible to set things on a per-hub basis.

Another thing I'd like to see - which admittedly is probably out of scope - is regex-based index_overrides. For example, we have many internal packages prefixed with qh- or pyle-, and having to list all those out in index_overrides is a little cumbersome.

All that said, I think that the core API of (bazel_)downloader, index_url, and index_overrides is pretty reasonable.

The indexes must support Simple API as described here:
<https://packaging.python.org/en/latest/specifications/simple-repository-api/>
If `skip` is used as a value, then we will not use the downloader for a particular package. The
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting to me. Can you elaborate on what problem it's trying to solve?

I think I like the idea, but I'm concerned that it'll cause hard-to-debug issues (such as having correct auth for Bazel downloader but not for the non-Bazel downloader case, or vice-versa).

The indexes must support Simple API as described here:
<https://packaging.python.org/en/latest/specifications/simple-repository-api/>
If `skip` is used as a value, then we will not use the downloader for a particular package. The
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
If `skip` is used as a value, then we will not use the downloader for a particular package. The
If `skip` is used as a value, then we will not use the Bazel downloader for a particular package. The

:::
""",
),
"index_overrides": attr.string_dict(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about cases where packages of a particular version must be downloaded from index A, while others versions must be downloaded from index B?

index_overrides = {
    "friendly-bard<=1.0.0,>3.0": INDEX_A,
    "friendly-bard": INDEX_B,
}

Or where windows versions are on a different index from linux versions?

TBH I'm not sure how useful such a thing would be, but I could see edge cases requiring it such as when a private index isn't yet updated with the latest version of a package but users must absolutely have the latest version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the use case? Why would an index have version 1 but not version 2? Or the distributions for windows, but not linux?

The case I can think of is e.g. cuda. Torch has different index urls for different cuda etc builds, but the env marker language can't express that anyways.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real-life example is airlock/, which has a 2 week soak time for new releases: someone deploys Version 2 to PyPI but it will not show up in Airlock for two weeks - only Version 1 is available until the new version gets vetted and whatnot.

A made-up example is that a package was migrated from INDEX_A to INDEX_B and because $REASON, historical artifacts could not be ported. I highly doubt such a case would occur.

I can't think of any logical reason why an index would only have artifacts of a given OS.

Admittedly this is mostly just theorycrafting. Even in the Airlock case, we just tell people to wait the 2 weeks...

:::
""",
),
"index_url": attr.string(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on dropping extra_index_urls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that extra index urls can come from pip args and requirements files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so extending that premise: why not also pull index_url from the requirements.txt files too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that we have to have a default somewhere if the index is not specified. If it is specified in the lock file, then we can either:

  • Use them as extra index_urls.
  • Use them and ignore the default that we set.
  • Use PyPI as a fallback which gets used if else fails.

At least this is for requirements.txt, for PEP751 lock files the index is there always (I think).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use [index_url specified in the lock file] as extra index_urls

I fear that this would risk package name collisions and would be difficult to figure out the index ordering.

Use them and ignore the default that we set.

This sounds reasonable to me, but then a root module will not be able to override a child module's config (which I believe is a desired feature) without adjusting the child's requirements.txt directly:

./MODULE.bazel        # sets pip.default(downloader = "enabled", index_url = BAR)
./requirements.txt    # sets --index-url FOO
./child_module/       
    MODULE.bazel      # sets pip.default(downloader = "enabled", index_url = AAA)
    requirements.txt  # sets --index-url BAD

If we ignore pip.default, the root MODULE.bazel will pull from FOO and the child will pull from BAD. Right?

Use PyPI as a fallback which gets used if else fails.

If I set index_url in any of my requirements.txt files, I do not want pypi used at all.

$ head requirements.txt 
# This file was autogenerated by uv via the following command:
#    uv pip compile --python-version=3.12 --output-file=requirements.txt requirements.in
--index-url https://[email protected]/REDACTED/simple
--extra-index-url https://[email protected]/REDACTED/simple

absl-py==2.3.0 \

for PEP751 lock files the index is there always (I think).

It's optional 😞


I think it's reasonable to put all index config in Bazel and completely ignore any index config set in requirements.txt or PEP751 pylock.toml files.

Option values:
* `disabled` - disable downloader.
* `debug` - stop using parallel downloading for debugging cases.
* `enabled` - enable downloader for all hubs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled would only apply to the hub that it's listed on, right?

pip.parse(hub_name="foo", downloader="enabled", ...)
pip.parse(hub_name="bar", downloader="disabled", ...)

Oh this is the global config as part of _default_attrs and the pypi module_extension, not the .parse tag class. I didn't notice that at first.

So this would be used via:

pip = use_extension(@rules_python//python/extensions:pip.bzl, "pip")
pip.default(downloader = "enabled", index_overrides = {...})
pip.parse(...)

Correct?

Would the pip.parse still support extending/overriding index_overrides and indexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering about the index_overrides - the root module should be able to override anything here. However, what happens if a non-root module is defining this override. To do this blindly is a little sweat-inducing because of all of the recent supply chain attacks the public registries faced. I would say that we should keep index_overrides as a root-module-only option. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that non-root module overrides is probably a Bad Idea ™️. I'm more asking about the pip.parse overrides:

pip.default(
    downloader = "enabled",
    index_url = AIRLOCK_INDEX_URL,
    index_overrides = {"foo": INDEX_A, "bar": INDEX_A},
)
pip.parse(
    # this parse pulls from 2 indexes: AIRLOCK_INDEX_URL and INDEX_A
    hub_name = "pypi",
    python_version = PYTHON_CURRENT_VERSION,
    requirements_lock = "//:requirements.txt",
)
pip.parse(
    # this parse pulls from 3 indexes: AIRLOCK_INDEX_URL for most things,
    # INDEX_A for 'bar' package, and INDEX_B for 'foo' package
    experimental_index_url_overrides = {"foo": INDEX_B},  # same package, different index
    hub_name = "pypi",
    python_version = PYTHON_NEXT_VERSION,
    requirements_lock = "//:requirements_pynext.txt",
)

Like elsewhere, this is mostly hypothetical - I personally haven't had a need to use different indexes for different (python versions | hubs | requirement lock files).

doc = """\
Option values:
* `disabled` - disable downloader.
* `debug` - stop using parallel downloading for debugging cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also increase logging levels? Or is that separate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we can get rid of parallel downloading. Could not think of a better name for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug is fine for such a case.

Having a dedicated serial, download_serially, sequential, etc. value might be too restrictive.

} | AUTH_ATTRS
} | AUTH_ATTRS | {
# Downloader options
"downloader": attr.string(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say bazel_downloader is better. As is, it kinda sounds like it would modify downloading altogether (similar to pip.parse.download_only).

default = "auto",
doc = """\
Option values:
* `disabled` - disable downloader.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: mention that pip will be used to download things instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, except instead of pip, be vague and say some external tool

:::
""",
),
"index_overrides": attr.string_dict(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real-life example is airlock/, which has a 2 week soak time for new releases: someone deploys Version 2 to PyPI but it will not show up in Airlock for two weeks - only Version 1 is available until the new version gets vetted and whatnot.

A made-up example is that a package was migrated from INDEX_A to INDEX_B and because $REASON, historical artifacts could not be ported. I highly doubt such a case would occur.

I can't think of any logical reason why an index would only have artifacts of a given OS.

Admittedly this is mostly just theorycrafting. Even in the Airlock case, we just tell people to wait the 2 weeks...

:::
""",
),
"index_url": attr.string(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so extending that premise: why not also pull index_url from the requirements.txt files too?

doc = """\
Option values:
* `disabled` - disable downloader.
* `debug` - stop using parallel downloading for debugging cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug is fine for such a case.

Having a dedicated serial, download_serially, sequential, etc. value might be too restrictive.

packages come from which source. We want to avoid issues similar to what happened in
https://pytorch.org/blog/compromised-nightly-dependency/.
The indexes must support Simple API as described here:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are url-based requirements specifiers in scope? For example pip @ https://github.com/pypa/pip/archive/22.0.2.zip is equivalent to

index_overrides = {
    "pip": "https://github.com/pypa/pip/archive/22.0.2.zip"
}

Note: IMO this should not be valid, I'm just trying to get clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants