Skip to content

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Aug 29, 2025

Fixes #13038.

The commit f787788 by @alex performed a PATH traversal and subsequent process execution (the output of rustc --version) in order to add some more information to the User-Agent request header that pip provides to its remote HTTP requests.

f787788 makes reference to information valuable to package maintainers, which appears to be propagated to PyPI's BigQuery output: #9987 (comment) (thanks so much to Alex for explaining this to me -- I'm not sure why the commit message didn't have a link to his PR discussion).

I had previously proposed removing the rustc --version call, but that has been removed and this PR now just enables overriding the telemetry identifier, as in title.

A discussion of standardized telemetry data began to form in #5499, but failed to graduate into a full PEP. Given the very high quality of other Python packaging standards, I would like to propose that this one also be made into a full-on standard, probably somewhere around the Simple Repository API page.

This PR does two things:

  1. Introduces PIP_TELEMETRY_USER_AGENT_ID, an environment variable which completely overrides the value used to form pip's User-Agent header.
  2. Proposes at length (see docstring) how to expand telemetry functionality in a respectful and thoughtful approach, through Python packaging standards.

@alex
Copy link
Member

alex commented Aug 29, 2025

It is not accurate that commit was pushed directly to master, it was submitted as a PR and reviewed #9987 (I'm not a pip committer and couldn't have pushed to master if I wanted)

@cosmicexplorer
Copy link
Contributor Author

I'm extremely sorry. I will delete that immediately. I pinged you because I found it very confusing given the other work I've seen from you.

@cosmicexplorer
Copy link
Contributor Author

I will review those PR comments now. I'm incredibly sorry once again and thanks so much for your very patient and thoughtful response.

@cosmicexplorer cosmicexplorer marked this pull request as draft August 29, 2025 12:32
@cosmicexplorer cosmicexplorer marked this pull request as ready for review August 29, 2025 12:43
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Aug 29, 2025

Ok, I have deleted the revert and retained the changes of #9987. Thanks so much @alex for linking me there. This now just proposes a separate environment variable to override the User-Agent setting. cc @uranusjr @pradyunsg

I can't reproduce the test failure and I am confident my changes continue to produce valid JSON for User-Agent. Unsure what's going on.

@notatallshaw
Copy link
Member

@cosmicexplorer FYI I fixed a typo in your post githb.com -> github.com, I hope you don't mind, the former URL was wanting to install an extension in my browser. And I've added that it fixes #13038.

The commit f787788 by @alex performed a PATH traversal and subsequent process execution (the output of rustc --version) in order to add some more information to the User-Agent request header that pip provides to its remote HTTP requests.

f787788 makes reference to information valuable to package maintainers, which appears to be propagated to PyPI's BigQuery output: #9987 (comment) (thanks so much to Alex for explaining this to me -- I'm not sure why the commit message didn't have a link to his PR discussion).

I had previously proposed removing the rustc --version call, but that has been removed and this PR now just enables overriding the telemetry identifier, as in title.

I don't really understand why the rustc call is being singled out as a security issue, there are many places where pip is calling subprocess.run, subprocess.check_call, and subprocess.check_output, both directly and via vendored code including, presumably, to work out the glib c and musl c versions.

While I see there being other motivations for wanting to override the telemetry part of the user agent string, I'm not sure I quite follow this reasoning. I wonder if it makes sense to appropriate #13038 or make a new issue where we can discuss the reasons and design for overriding the telemetry, rather than the technical issues of the PR.

Overall, I am currently neither for or against this feature, but I will leave a quick review on an immediate technical issue I see with your code.

def user_agent_id() -> str:
return Telemetry.calculate_user_agent_id(os.environ)

CLOBBER_USER_AGENT_ENV_VAR: ClassVar[str] = "PIP_TELEMETRY_USER_AGENT_ID"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever the option is names or becomes please make this a regular pip option, e.g. --telemetry-user-agent-id, it will therefore be accessible via command line, environmental variable (as PIP_TELEMETRY_USER_AGENT_ID), and configuration files.

You can see recent PRs #13534 and #13520 for adding new pip options.

Comment on lines +124 to +127
We now introduce `PIP_TELEMETRY_USER_AGENT_ID`, which will completely overwrite the
string transmitted to remote hosts that pip communicates it. To maintain backwards
compatibility, it is disabled by default, but can be set to the empty string or any
other value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unconvinced by this design over having an option like "disable telemetry", what is the value in users being able to replace the user string with a static arbitrary string?

Comment on lines +76 to +78
FIXME: This string is currently propagated as a header into every single HTTP
request pip makes. It should really be subject to a formal PEP process
in order to ensure user consent around telemetry is respected.
Copy link
Member

@notatallshaw notatallshaw Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please propose this in https://discuss.python.org/c/packaging/14, not in a pip doc string. It's not clear to me that everyone will agree with this.

Comment on lines +92 to +112
Environment markers can therefore be viewed as a language for codebases to
communicate their build process and requirements
(https://packaging.python.org/en/latest/flow/#the-packaging-flow) to a generic
resolver. Spack specs
(https://spack.readthedocs.io/en/latest/spec_syntax.html#sec-specs)
are very similar in spirit, and arose precisely to codify the recursive
and conditional relationships across multi-language software stacks.

Extending this concept, we could consider the *user agent id* as a language for
resolvers to communicate their requirements to external services and
execution environments. For example, clients for the GNU Make Jobserver protocol use
an environment variable to indicate how the jobserver should communicate to them,
and to indicate the maximum bandwidth they can tolerate for parallel execution:
(https://www.gnu.org/software/make/manual/html_node/Job-Slots.html).

The pants build tool specifically highlights the risk of exposing proprietary
information through thoughtless telemetry (https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry).
As a result, not only do they explicitly specify the information being recorded
(https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#what-data-is-sent),
but they additionally incorporate anonymity as an explicit design goal
(https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#how-we-ensure-anonymity).
Copy link
Member

@notatallshaw notatallshaw Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems all quite opinionated to be in a docstring, and doesn't recognize a malicious index could extract most of the information from a user by simply observing what wheels are downloaded.

Negotiating standards around useful telemetry data for PyPI began here
(https://github.com/pypa/pip/issues/5499), but never became a full PEP. This commit
(https://github.com/pypa/pip/commit/f787788a65cf7a8af8b7ef9dc13c0681d94fff5f) added
the output of an arbitrary subprocess execution into the string pip attaches to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an "arbitrary" subprocess execution, it's an explicitly named subprocess execution.

# because some CI systems mimic a tty (e.g. Travis CI). Thus that
# method doesn't provide definitive information in either direction.
return any(name in os.environ for name in CI_ENVIRONMENT_VARIABLES)
_rustc_output_regex: ClassVar[re.Pattern[str]] = re.compile(r"^rustc ([^\s]+)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a concern about what the rustc execution could be sending to the index then I am willing to volunteer reviewing a separate PR that adds more strict validation (presumably based on rustc documentation).

But as best as I can tell this is just a regex implementation of the prior code:

if rustc_output.startswith(b"rustc "):
    # The format of `rustc --version` is:
    # `b'rustc 1.52.1 (9bc8c42bb 2021-05-09)\n'`
    # We extract just the middle (1.52.1) part
    data["rustc_version"] = rustc_output.split(b" ")[1].decode()

IMO the prior code had multiple advantages:

  1. It documents it's goal
  2. I don't have to understand a regex to understand the code
  3. It didn't require decoding the whole string (which if we don't trust the string seems better?)

I would at least want the documented goal of this back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip.conf option to disable linehaul metadata in user-agent
3 participants