Skip to content

Conversation

@lihaoyi
Copy link
Member

@lihaoyi lihaoyi commented Mar 4, 2025

The biggest challenge with making out/ folder reproducible is removing absolute paths. This can be a challenge:

  1. Paths are everywhere, sometimes as os.Paths, sometimes in JSON blobs written to disk, sometimes in strings like "-Xplugin=...".

  2. These paths will then get interpreted by in-memory libraries (e.g. Zinc), Scala subprocesses we control (e.g. mill.testrunner.TestRunnerMain, or subprocesses we do not control (e.g. native-image).

  3. These subprocesses may run in a variety of different working directories: some in the workspace root, some in a .dest folder, or elsewhere.

  4. The same path may be passed to different subprocesses, written in different languages, running in different working directories, and should behave the same way.

All of the above works fine if we are using absolute paths: an absolute path means the same thing regardless of how it is serialized or who it is passed to. That is the reason OS-Lib uses absolute os.Paths by default. However, absolute paths are not reproducible: someone running code in the folder /Users/lihaoyi/mill or /Users/someone-else/mill will have different absolute paths, and would thus be unable to share caches keyed on the hashes of those paths.

Serializing Absolute Paths to Relative Paths

In order to make this work, we need to serialize absolute paths as relative paths. As we do not in general know where a serialized path is going to end up being used, we need to serialize absolute paths as relative paths that reference the same final destination regardless of the cwd of the process it is passed to. We can do this as follows:

  1. Serialize all absolute paths relative to some stable root folder, e.g. anything in os.home / "foo/bar/baz" gets serialized as "out/mill-home/foo/bar/baz", anything in Task.workspace / "foo/bar/baz" gets serialized as "out/mill-workspace/foo/bar/baz"

  2. Whenever we spawn external processes, we synthesize out/mill-home and out/mill-workspace symlinks that point to their respective destinations os.home and Task.workspace

  3. Any subprocess that reads in these paths and then tries to dereference "out/mill-home/foo/bar/baz" or "out/mill-home/foo/bar/baz", regardless of language, will follow the filesystem symlink and end up reading from the right place on disk. (With the notable exception of Scala subprocesses using OS-Lib, which is stricter about deserializing relative paths as absolute paths than most platforms)

This is a similar approach taken by Bazel's Symlink Sandbox, which generates local symlinks in the working directory of any subprocess that Bazel spawns.

Implementation Details

  1. We hook into OS-Lib to control the serialization of os.Path, allowing it to be written out as prefixed relative paths, and read back in as prefixed relative paths.

  2. We also hook into OS-Lib to create the out/mill-home and out/mill-workspace symlinks every time you spawn a process, in that process' working directory

Limitations

This approach to making Mill's paths reproducible suffers from many of the same weaknesses that Bazel has. In general, the symlinks work 99% of the time, but once in a while something can go wrong:

  1. The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include subprocesses spawned by third-party libraries we use

  2. Any transitive subprocesses spawned by the subprocesses that Mill creates are out of our control, and thus may not have the proper symlinks set up if the working directory differs from their parent

  3. The symlinks are not transparent: user code will be able to see that the mill-workspace folder is a symlink, and some code may not behave correctly when traversing symlinks

  4. Some code demands absolute paths and provides no alternative, e.g. the native-image binary for generating Graal executables.

  5. Any subprocesses using OS-Lib to deserialize the paths (e.g. our own) need to explicitly use the os.Path(_, os.pwd) constructor to allow it to handle relative paths and resolve them from the current working directly.

Notes

There are some non-path-related changes in this PR:

  1. The Zinc incremental compiler has its own ReadWriteMapper mechanism for customizing serialization of paths, so we hook into that in ZincWorkerImpl

  2. We tweak the valueHash computation in GroupExecution to take the hash of the serialized JSON, rather than of the original JVM object, since different os.Paths with different hashes may serialize to the same relative path after the working directly has been substituted, and so we need to hash the JSON to make sure we get a stable hash

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 4, 2025

This is getting pretty close to getting ./mill -w 'integration.feature[reproducibility].local.server' mill.integration.ReproducibilityTests.diff to pass. The current and seemingly last blocker is sbt/zinc#1540, where Zinc's analysis files have some inherent non-determinism, but everything else seems to be deterministic.

@lihaoyi lihaoyi changed the title Reproducible 2 2nd attempt at Reproducible out folder contents Mar 4, 2025
@roman-mibex-2
Copy link

The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include sub-processes spawned by third-party libraries we use

One random idea I had once, but never enough time to explore: Use a java.nio.File system to have a virtual paths. With the idea that:

  • Generic java libraries would go through the our 'Mill' path resolution
  • .toStrings would do something like you describe here, with sym-links to out/home, etc: For everything that might end up in an external process.
    However, that is a way lager API and I'm not sure if it gives more control in reality.

I guess, if the ProcessBuilder becomes a common issue deep inside other libraries: We still could think of adding a Java Agent to 'fix' paths.

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 8, 2025

@roman-mibex-2 that could work, but one big issue is we need to support passing absolute paths to third-party subprocesses as well, and those are entirely out of the JVM's control. So we do need some kind of OS/FS-level handling to make those relative paths work

Using a java agent to fix paths passed to third part subprocesses won't work, because the subprocesses get a mix of files-on-disk, command-line strings, environment variable strings, any of which could have the paths embedded within them. It's impossible to look at a string or file and identify if it is whole-ly or partially composed of absolute paths, e.g. how would we look at a gzipped messagepack blob on disk and fix up the paths there?

One generalization that could make your idea work is to use a FUSE filesystem to redirect the paths at the filesystem level. Bazel offers this IIRC on linux (https://bazel.build/versions/7.5.0/docs/sandboxing#sandboxfs), not sure how hard it would be to port over to Mill. Running such a FUSE filesystem across different Mac/Linux/Windows environments would also be a challenge (e.g. needing kernel drivers or sudo)

@ajaychandran
Copy link
Contributor

ajaychandran commented Mar 16, 2025

Maybe this is helpful.
Turborepo keeps track of environment variables used by tasks. This seems central to their remote caching feature.

@ajaychandran
Copy link
Contributor

ajaychandran commented Mar 16, 2025

Another idea is to String.replace the serialized task values with "known things".

@lihaoyi lihaoyi force-pushed the main branch 2 times, most recently from 1d3b959 to 3ba698a Compare July 10, 2025 04:00
@lefou lefou added the later The issue is still relevant, but has now high priority right now label Oct 30, 2025
@lihaoyi lihaoyi closed this Nov 2, 2025
@lihaoyi lihaoyi reopened this Feb 6, 2026
@lihaoyi
Copy link
Member Author

lihaoyi commented Feb 6, 2026

Rebased this on top of latest main branch, needs sbt/zinc#1638 for the reproducibility test to pass

@lefou
Copy link
Member

lefou commented Feb 6, 2026

I don't think using sym-links is what we want.

  1. It assumes a hard-coded out/ dir, which we don't want and don't have any longer. In fact, we can configure the out/ today, and it's an important thing to build projects from a read-only source directory.

  2. It's dependent to the state of the executing platform and it's really hard to debug issues caused by someone/something mangling the sym-links.

  3. Since we need to convert paths to sym-link-paths, but the result is itself a valid path, there is much room for misinterpretation and error. What if users use the sym-links in input, that are supposed to be converted to a sym-link-path again? What if sym-linked paths contain a route to other sym-links?

I think the approach to split the problem into a) path mapping and b) paths in configuration data is superior to implicitly persisting via sym-links. I have two draft PRs (that were functional at multiple time points but need rebasing now) that demonstrate the concept and worked. #6031 is for the path mapping and #6129 for the configuration data. Since the latter is using a data structure to hold paths, there is no room for misinterpretation or malicious injection.

One learning outcome was, that we don't need to use the path-mapping when spawning other processes. Or more general, we don't need the path-mapping at build-time¹, but only for serializing the cached results, since that's the only thing we want to share with other Mill processes over space and time. The sym-links try to solve something that's probably not needed at all.


¹ Some tools output we store directly need to know it. E.g. zinc for the incremental state, but there is typically a mechanism to handle that.

@lefou
Copy link
Member

lefou commented Feb 6, 2026

Another idea is to String.replace the serialized task values with "known things".

We never want to do this on unstructured text. We need to be 100% certain the string to replace is the path we want to mangle.

eed3si9n pushed a commit to sbt/zinc that referenced this pull request Feb 6, 2026
reviously, Zinc could produce nondeterministic analysis output because `binaryClassName` used `put(binary, className)` from concurrent `externalLibraryDependency` callbacks. The final value depended on callback scheduling, so `binaryClassName` stores one representative class per binary JAR in a concurrent map, but representative selection was non-deterministic.

This PR replaces `put` with deterministic merge using `updateWith`, choosing the lexicographically smallest class name for each binary. `libraryClassName` only needs a stable representative class per binary for lookup/stamp checks; selecting a deterministic representative preserves behavior while removing scheduler dependence. This removes one source of nondeterminism in Zinc analysis serialization and downstream build cache hashes, which was blocking attempts at reproducible builds in Mill (e.g. com-lihaoyi/mill#4642)
@lihaoyi
Copy link
Member Author

lihaoyi commented Feb 8, 2026

All of the problems you bring up are true, but we know that symlinked builds basically work despite those problems since that's the approach used by Bazel. Bazel does suffer all the problems, but it works well enough to be useful.

This is extremely messy problem space, so it's basically impossible to make it work seamlessly 100% of the time. But I'm less confident that a bespoke design will cover everything we need, v.s. Bazel's battle-tested approach with known flaws and shortcomings

@lefou
Copy link
Member

lefou commented Feb 8, 2026

I simply can't hear the "Bazel does it, so it's blessed" argument anymore. All Mill tries is to do stuff better than other tools, if we see some potential room for improvement.
The JVM world has enough build tools already. If we don't improve over what these already provide, there is no need for Mill. I still have to find a single project driven by Bazel that's easy to use and understand. That can't be the measuring staff.

@lihaoyi
Copy link
Member Author

lihaoyi commented Feb 8, 2026

Bazel has a lot of issues, but in many areas it is AFAIK the state of the art, and one of those areas is the space of reproducible/relocatable builds.

If there are better designs we can compare it against I'm all ears, but from what I've seen the approaches taken by other tools such as SBT/Gradle/etc. are generally much messier, have much higher API surface area, and are easier for users and plugins to screw up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

later The issue is still relevant, but has now high priority right now run-all-tests Disables selective test execution on this pR and just runs all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants