-
-
Notifications
You must be signed in to change notification settings - Fork 429
2nd attempt at Reproducible out folder contents #4642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This is getting pretty close to getting |
One random idea I had once, but never enough time to explore: Use a java.nio.File system to have a virtual paths. With the idea that:
I guess, if the |
|
@roman-mibex-2 that could work, but one big issue is we need to support passing absolute paths to third-party subprocesses as well, and those are entirely out of the JVM's control. So we do need some kind of OS/FS-level handling to make those relative paths work Using a java agent to fix paths passed to third part subprocesses won't work, because the subprocesses get a mix of files-on-disk, command-line strings, environment variable strings, any of which could have the paths embedded within them. It's impossible to look at a string or file and identify if it is whole-ly or partially composed of absolute paths, e.g. how would we look at a gzipped messagepack blob on disk and fix up the paths there? One generalization that could make your idea work is to use a FUSE filesystem to redirect the paths at the filesystem level. Bazel offers this IIRC on linux (https://bazel.build/versions/7.5.0/docs/sandboxing#sandboxfs), not sure how hard it would be to port over to Mill. Running such a FUSE filesystem across different Mac/Linux/Windows environments would also be a challenge (e.g. needing kernel drivers or sudo) |
|
Maybe this is helpful. |
|
Another idea is to |
1d3b959 to
3ba698a
Compare
e218f04 to
0c15f18
Compare
|
Rebased this on top of latest main branch, needs sbt/zinc#1638 for the reproducibility test to pass |
|
I don't think using sym-links is what we want.
I think the approach to split the problem into a) path mapping and b) paths in configuration data is superior to implicitly persisting via sym-links. I have two draft PRs (that were functional at multiple time points but need rebasing now) that demonstrate the concept and worked. #6031 is for the path mapping and #6129 for the configuration data. Since the latter is using a data structure to hold paths, there is no room for misinterpretation or malicious injection. One learning outcome was, that we don't need to use the path-mapping when spawning other processes. Or more general, we don't need the path-mapping at build-time¹, but only for serializing the cached results, since that's the only thing we want to share with other Mill processes over space and time. The sym-links try to solve something that's probably not needed at all. ¹ Some tools output we store directly need to know it. E.g. zinc for the incremental state, but there is typically a mechanism to handle that. |
We never want to do this on unstructured text. We need to be 100% certain the string to replace is the path we want to mangle. |
reviously, Zinc could produce nondeterministic analysis output because `binaryClassName` used `put(binary, className)` from concurrent `externalLibraryDependency` callbacks. The final value depended on callback scheduling, so `binaryClassName` stores one representative class per binary JAR in a concurrent map, but representative selection was non-deterministic. This PR replaces `put` with deterministic merge using `updateWith`, choosing the lexicographically smallest class name for each binary. `libraryClassName` only needs a stable representative class per binary for lookup/stamp checks; selecting a deterministic representative preserves behavior while removing scheduler dependence. This removes one source of nondeterminism in Zinc analysis serialization and downstream build cache hashes, which was blocking attempts at reproducible builds in Mill (e.g. com-lihaoyi/mill#4642)
|
All of the problems you bring up are true, but we know that symlinked builds basically work despite those problems since that's the approach used by Bazel. Bazel does suffer all the problems, but it works well enough to be useful. This is extremely messy problem space, so it's basically impossible to make it work seamlessly 100% of the time. But I'm less confident that a bespoke design will cover everything we need, v.s. Bazel's battle-tested approach with known flaws and shortcomings |
|
I simply can't hear the "Bazel does it, so it's blessed" argument anymore. All Mill tries is to do stuff better than other tools, if we see some potential room for improvement. |
|
Bazel has a lot of issues, but in many areas it is AFAIK the state of the art, and one of those areas is the space of reproducible/relocatable builds. If there are better designs we can compare it against I'm all ears, but from what I've seen the approaches taken by other tools such as SBT/Gradle/etc. are generally much messier, have much higher API surface area, and are easier for users and plugins to screw up. |
d878516 to
f6ba31d
Compare
6ccbf27 to
758de9f
Compare
5659fc5 to
a89c243
Compare
1550fc5 to
365a4b0
Compare
52df04d to
7a5de49
Compare
95b5744 to
67dd65b
Compare
6df1ea9 to
3d47776
Compare
4c94600 to
4098c0e
Compare
The biggest challenge with making
out/folder reproducible is removing absolute paths. This can be a challenge:Paths are everywhere, sometimes as
os.Paths, sometimes in JSON blobs written to disk, sometimes in strings like"-Xplugin=...".These paths will then get interpreted by in-memory libraries (e.g. Zinc), Scala subprocesses we control (e.g.
mill.testrunner.TestRunnerMain, or subprocesses we do not control (e.g.native-image).These subprocesses may run in a variety of different working directories: some in the workspace root, some in a
.destfolder, or elsewhere.The same path may be passed to different subprocesses, written in different languages, running in different working directories, and should behave the same way.
All of the above works fine if we are using absolute paths: an absolute path means the same thing regardless of how it is serialized or who it is passed to. That is the reason OS-Lib uses absolute
os.Paths by default. However, absolute paths are not reproducible: someone running code in the folder/Users/lihaoyi/millor/Users/someone-else/millwill have different absolute paths, and would thus be unable to share caches keyed on the hashes of those paths.Serializing Absolute Paths to Relative Paths
In order to make this work, we need to serialize absolute paths as relative paths. As we do not in general know where a serialized path is going to end up being used, we need to serialize absolute paths as relative paths that reference the same final destination regardless of the
cwdof the process it is passed to. We can do this as follows:Serialize all absolute paths relative to some stable root folder, e.g. anything in
os.home / "foo/bar/baz"gets serialized as"out/mill-home/foo/bar/baz", anything inTask.workspace / "foo/bar/baz"gets serialized as"out/mill-workspace/foo/bar/baz"Whenever we spawn external processes, we synthesize
out/mill-homeandout/mill-workspacesymlinks that point to their respective destinationsos.homeandTask.workspaceAny subprocess that reads in these paths and then tries to dereference
"out/mill-home/foo/bar/baz"or"out/mill-home/foo/bar/baz", regardless of language, will follow the filesystem symlink and end up reading from the right place on disk. (With the notable exception of Scala subprocesses using OS-Lib, which is stricter about deserializing relative paths as absolute paths than most platforms)This is a similar approach taken by Bazel's Symlink Sandbox, which generates local symlinks in the working directory of any subprocess that Bazel spawns.
Implementation Details
We hook into OS-Lib to control the serialization of
os.Path, allowing it to be written out as prefixed relative paths, and read back in as prefixed relative paths.We also hook into OS-Lib to create the
out/mill-homeandout/mill-workspacesymlinks every time you spawn a process, in that process' working directoryLimitations
This approach to making Mill's paths reproducible suffers from many of the same weaknesses that Bazel has. In general, the symlinks work 99% of the time, but once in a while something can go wrong:
The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by
java.lang.ProcessBuilder, which may include subprocesses spawned by third-party libraries we useAny transitive subprocesses spawned by the subprocesses that Mill creates are out of our control, and thus may not have the proper symlinks set up if the working directory differs from their parent
The symlinks are not transparent: user code will be able to see that the
mill-workspacefolder is a symlink, and some code may not behave correctly when traversing symlinksSome code demands absolute paths and provides no alternative, e.g. the
native-imagebinary for generating Graal executables.Any subprocesses using OS-Lib to deserialize the paths (e.g. our own) need to explicitly use the
os.Path(_, os.pwd)constructor to allow it to handle relative paths and resolve them from the current working directly.Notes
There are some non-path-related changes in this PR:
The Zinc incremental compiler has its own
ReadWriteMappermechanism for customizing serialization of paths, so we hook into that inZincWorkerImplWe tweak the
valueHashcomputation inGroupExecutionto take the hash of the serialized JSON, rather than of the original JVM object, since differentos.Paths with different hashes may serialize to the same relative path after the working directly has been substituted, and so we need to hash the JSON to make sure we get a stable hash