Skip to content

Conversation

agriyakhetarpal
Copy link
Member

@agriyakhetarpal agriyakhetarpal commented Feb 22, 2025

Description

This PR adds support for the SOURCE_DATE_EPOCH environment variable when creating wheels from recipes to enable reproducible builds. The implementation follows the Python example from the https://reproducible-builds.org specification.

I can't say that this change makes builds completely reproducible, as we are responsible only for these utility functions and our usage of pypa/build as a build frontend – which doesn't guarantee reproducibility at all (see pypa/build#385); that responsibility relies on the build backend, which we are agnostic to. It's a start towards more reproducibility, however.

I've modified the _make_whlfile function to set the timestamps based on SOURCE_DATE_EPOCH. When SOURCE_DATE_EPOCH is provided, all files inside the wheel will have their timestamps set to the provided date. "January 1st 1980 00:00:00 UTC" (Unix timestamp 315532800) is used as it is the minimum for ZIP file compatibility. A reproducible_filter has been added for tarfile creation that acts similarly to the "data" filter but with added SOURCE_DATE_EPOCH consideration.

The goal is to make static/shared library packages, wheels, and the tests unvendored from wheels if set as reproducible as possible.

@agriyakhetarpal agriyakhetarpal marked this pull request as ready for review February 22, 2025 19:06
Copy link
Member

@ryanking13 ryanking13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agriyakhetarpal! I think it is a good improvement. Also adding cross-ref for the previous discussion: pyodide/pyodide#4286

Could we have a test for the new behavior?

@agriyakhetarpal
Copy link
Member Author

Thanks for the review, @ryanking13 – I have added a bunch of tests, and also manually confirmed with the integration tests that the change works towards their reproducibility. This should resolve most of the conversation in pyodide/pyodide#4286. I'll leave a comment there about it, which should help, as issues/PRs cross-linked from discussions don't show up in the UI.

I had a couple of questions about the downstream integration of this feature based on that discussion:

  • Should we apply SOURCE_DATE_EPOCH somewhere in our build process in the Pyodide repository, such as the Dockerfile? We could point to the timestamp when the Docker images are built from the Dockerfile using our manual GHA trigger. See: https://docs.docker.com/build/ci/github-actions/reproducible-builds/
  • Should we have a test downstream that checks that the pyodide-lock.json will not be modified by repeatedly rebuilding packages, or is that mostly handled by the tests here?

@hoodmane
Copy link
Member

Should we ...?

I'm not sure but both of these sound like good ideas to me. I know that there is a bit of no determinism in the binaries generated by emcc, not sure about shared libraries. Maybe we should open an issue on emscripten asking what the status of reproducible builds is.

@hoodmane
Copy link
Member

What feels to me like it would make the most sense would be to set the date based on when the latest commit occurred.

@agriyakhetarpal
Copy link
Member Author

I found emscripten-core/emscripten#7714, though from the conversation I feel that it was closed as "won't fix" for an unrelated reason. I don't want to comment there asking for an update as it is quite old, but if the premise is that if LLVM guarantees reproducibility as per https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html, we should be fine with even level 1 determinism even if we can't have level 4 determinism right now.

One thing that could be nice is to switch the Pyodide runtime's versioning scheme to be based on Git tags. That could require a bit of effort because of all the moving parts everywhere, though (especially downstream).

@agriyakhetarpal
Copy link
Member Author

agriyakhetarpal commented Feb 24, 2025

I'll mark it as a draft for now, as I notice that the SOURCE_DATE_EPOCH isn't being respected properly when I run pyodide build-recipes, but it works well with pyodide build on an individual source directory. I'll be attempting to generate a lockfile in the integration tests, comparing with which can ensure that the checksums aren't modified when rebuilding as much as possible (it is not the case right now).

@agriyakhetarpal agriyakhetarpal marked this pull request as draft February 24, 2025 09:47
@hoodmane
Copy link
Member

We filter environment variables. Is the problem that you haven't updated that logic to let SOURCE_DATE_EPOCH through?

@agriyakhetarpal
Copy link
Member Author

Ah, I didn't know that. Could you please share a pointer where we do so in the pyodide build-recipes phase?

@agriyakhetarpal
Copy link
Member Author

I've tried to pass it to the BuildArgs dataclass for a start, but I don't know if that's the right approach.

@hoodmane
Copy link
Member

You need to add it here:
https://github.com/pyodide/pyodide-build/blob/main/pyodide_build/config.py?plain=1#L156

Any environment variable not in BUILD_KEY_TO_VAR is filtered out.

@agriyakhetarpal
Copy link
Member Author

Done, but I'm not too sure if it works – as the checksum for pydoc_data-1.0.0-py2.py3-none-any.whl still changes every time in the generated lockfile.

However, running SOURCE_DATE_EPOCH=1735689600 pyodide build integration_tests/recipes/pydoc_data/src always generates a wheel with the same checksum.

@hoodmane
Copy link
Member

hoodmane commented Feb 24, 2025

Make the recipe echo the environment variable in the build script to check if it made it through.

@agriyakhetarpal
Copy link
Member Author

Yes, it makes it through:

 26100K .......... .......... .......... .......... .......... 99% 16.5M 0s
 26150K .......... .......... .......... .......... .......... 99% 18.3M 0s
 26200K .......... .......... .......... .......... .......... 99% 14.0M 0s
 26250K .......... .......... .......... .......... .......... 99% 26.7M 0s
 26300K .......... .......... .......... .......... .......... 99% 15.7M 0s
 26350K .......... .......... .......... ...                  100% 12.2M=1.3s

2025-02-24 16:26:34 (19.6 MB/s) - ‘Python-3.12.7.tgz’ saved [27016272/27016272]

SOURCE_DATE_EPOCH: 1735689600
Successfully built /Users/agriyakhetarpal/Desktop/agriya-pyodide-build/integration_tests/recipes/pydoc_data/build/pydoc_data-1.0.0/dist/pydoc_data-1.0.0-py2.py3-none-any.whl
Unpacking wheel to                                                                                                                           
/Users/agriyakhetarpal/Desktop/agriya-pyodide-build/integration_tests/recipes/pydoc_data/build/pydoc_data-1.0.0/dist/pydoc_data-1.0.0-py2.py3
-none-any.whl                                                                                                                                
Unpacking to: /var/folders/b3/2bq1m1_50bs4c7305j8vxcqr0000gn/T/tmpek4du66r/pydoc_data-1.0.0...OK
Repacking wheel as /Users/agriyakhetarpal/Desktop/agriya-pyodide-build/integration_tests/recipes/pydoc_data/build/pydoc_data-1.0.0/dist/pydoc_data-1.0.0-py2.py3-none-any.whl...OK
[2025-02-24 16:26:35] Succeeded building package pydoc_data in 3.5 seconds.

That makes me think, maybe it is pyodide-lock that unpacks and repacks the wheels to get the metadata? I'm completely unfamiliar with its codebase at the moment, but could start reading it.

@agriyakhetarpal
Copy link
Member Author

That makes me think, maybe it is pyodide-lock that unpacks and repacks the wheels to get the metadata? I'm completely unfamiliar with its codebase at the moment, but could start reading it.

Answering my own question: yes, I think this is what is happening, as

shasum -a 256 integration_tests/recipes/pydoc_data/dist/pydoc_data-1.0.0-py2.py3-none-any.whl

returns the same output on repeated runs.

@agriyakhetarpal
Copy link
Member Author

That means we should be good to go for this PR, but we'll need to update pyodide-lock similarly, and get a new v0.1.0a9 release out for it to be used here. Thanks for the tips, @hoodmane!

@agriyakhetarpal
Copy link
Member Author

agriyakhetarpal commented Feb 24, 2025

🤔 Building NumPy repeatedly outside of the lockfile generation displays different checksums every time for its wheels1. Still, upon opening the wheel and the test archives, I see that SOURCE_DATE_EPOCH has been respected properly. We can't control Meson too much at all as a build frontend, but their motto is that "non-reproducibility is a bug": https://mesonbuild.com/Reproducible-builds.html

However, both the shasum -a 256 integration_tests/recipes/.libs/lib/libz.a and shasum -a 256 integration_tests/recipes/.libs/lib/libgeos.so commands return the same checksum for zlib and geos respectively, indicating that their builds were reproducible indeed.

So, in general, I wonder if we're checking for the right things here by trying to validate checksums.

Footnotes

  1. This is possibly because NumPy's recipe has a post: section. Unvendoring the tests is reproducibly managed, but the post-script doesn't go through such code paths at the moment, I'll investigate.

@agriyakhetarpal
Copy link
Member Author

The difference in the checksums comes from the fact that we don't build without isolation.

diff -r integration_tests/recipes/numpy/dist/numpy-2.2.3-cp312-cp312-pyodide_2024_0_wasm32/numpy/__config__.py integration_tests/recipes/numpy/dist1/numpy-2.2.3-cp312-cp312-pyodide_2024_0_wasm32/numpy/__config__.py
96c96
<             "path": r"/private/var/folders/b3/2bq1m1_50bs4c7305j8vxcqr0000gn/T/build-env-eecbpwc2/bin/python",
---
>             "path": r"/private/var/folders/b3/2bq1m1_50bs4c7305j8vxcqr0000gn/T/build-env-mtjzvpw6/bin/python",
diff -r integration_tests/recipes/numpy/dist/numpy-2.2.3-cp312-cp312-pyodide_2024_0_wasm32/numpy-2.2.3.dist-info/RECORD integration_tests/recipes/numpy/dist1/numpy-2.2.3-cp312-cp312-pyodide_2024_0_wasm32/numpy-2.2.3.dist-info/RECORD
1c1
< numpy/__config__.py,sha256=pcS5KlVZFsJDU9lAXUg_JGvcAceIBPbP_1bcQBdjZlM,5015
---
> numpy/__config__.py,sha256=J9aPth6duJdP4BjknvJiCVOg1JIQa4FInIP4SqrFoSw,5015

Subsequently, as the wheel differs, numpy-tests.tar also propagates the same effect for some reason. Note to self: here's the diff from the timestamps in the test tarballs: https://www.diffchecker.com/XSGLlhRx/

@agriyakhetarpal
Copy link
Member Author

I think I'm running into the same problem I faced in pyodide/pyodide#5031. Is there a way to add --no-build-isolation when building NumPy, or would we need to add support for that?

@agriyakhetarpal
Copy link
Member Author

TODO:

  • Reproducible builds for packages in the integration tests: pydoc_data, geos, and zlib
  • Fix NumPy reproducibility issue by allowing it to not build under isolation
    • Extend pypabuild.py and our CLI to have a non-isolated option based on pypa/build code paths?
  • Investigate the existence of post-script reproducibility issue
  • Generate lockfile in integration tests and add a small bash/jq script as a test for reproducibility of sorts
    • Investigate why the lockfile generation step (possibly in pyodide-lock) modifies the checksums for packages

@ryanking13
Copy link
Member

You need to add it here: https://github.com/pyodide/pyodide-build/blob/main/pyodide_build/config.py?plain=1#L156

Any environment variable not in BUILD_KEY_TO_VAR is filtered out.

I believe this is not true. If it does, it should be a bug. We passthrough all host env variables to the builder.

@ryanking13
Copy link
Member

ryanking13 commented Feb 24, 2025

I think I'm running into the same problem I faced in pyodide/pyodide#5031. Is there a way to add --no-build-isolation when building NumPy, or would we need to add support for that?

Thanks for the investigation. Supporting --no-build-isolation would be a reasonable approach to fix this issue. But I also think it might a bit over-engineering just to solve the reproducibility issue. I think it is something that the downstream packages should handle, not the build system.

Anyway, we've got the request to support no isolation mode a few times, so it's probably worth discussing in a separate issue.

@ryanking13
Copy link
Member

ryanking13 commented Feb 24, 2025

That makes me think, maybe it is pyodide-lock that unpacks and repacks the wheels to get the metadata? I'm completely unfamiliar with its codebase at the moment, but could start reading it.

Interesting. Probably wheel CLI to unpack and repack the wheel is changing the timestamp in the zipfile? It is not in pyodide-lock but I think it will be moved to pyodide-lock in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants