2nd attempt at Reproducible out folder contents #4642

lihaoyi · 2025-03-04T02:41:29Z

The biggest challenge with making out/ folder reproducible is removing absolute paths. This can be a challenge:

Paths are everywhere, sometimes as os.Paths, sometimes in JSON blobs written to disk, sometimes in strings like "-Xplugin=...".
These paths will then get interpreted by in-memory libraries (e.g. Zinc), Scala subprocesses we control (e.g. mill.testrunner.TestRunnerMain, or subprocesses we do not control (e.g. native-image).
These subprocesses may run in a variety of different working directories: some in the workspace root, some in a .dest folder, or elsewhere.
The same path may be passed to different subprocesses, written in different languages, running in different working directories, and should behave the same way.

All of the above works fine if we are using absolute paths: an absolute path means the same thing regardless of how it is serialized or who it is passed to. That is the reason OS-Lib uses absolute os.Paths by default. However, absolute paths are not reproducible: someone running code in the folder /Users/lihaoyi/mill or /Users/someone-else/mill will have different absolute paths, and would thus be unable to share caches keyed on the hashes of those paths.

Serializing Absolute Paths to Relative Paths

In order to make this work, we need to serialize absolute paths as relative paths. As we do not in general know where a serialized path is going to end up being used, we need to serialize absolute paths as relative paths that reference the same final destination regardless of the cwd of the process it is passed to. We can do this as follows:

Serialize all absolute paths relative to some stable root folder, e.g. anything in os.home / "foo/bar/baz" gets serialized as "out/mill-home/foo/bar/baz", anything in Task.workspace / "foo/bar/baz" gets serialized as "out/mill-workspace/foo/bar/baz"
Whenever we spawn external processes, we synthesize out/mill-home and out/mill-workspace symlinks that point to their respective destinations os.home and Task.workspace
Any subprocess that reads in these paths and then tries to dereference "out/mill-home/foo/bar/baz" or "out/mill-home/foo/bar/baz", regardless of language, will follow the filesystem symlink and end up reading from the right place on disk. (With the notable exception of Scala subprocesses using OS-Lib, which is stricter about deserializing relative paths as absolute paths than most platforms)

This is a similar approach taken by Bazel's Symlink Sandbox, which generates local symlinks in the working directory of any subprocess that Bazel spawns.

Implementation Details

We hook into OS-Lib to control the serialization of os.Path, allowing it to be written out as prefixed relative paths, and read back in as prefixed relative paths.
We also hook into OS-Lib to create the out/mill-home and out/mill-workspace symlinks every time you spawn a process, in that process' working directory

Limitations

This approach to making Mill's paths reproducible suffers from many of the same weaknesses that Bazel has. In general, the symlinks work 99% of the time, but once in a while something can go wrong:

The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include subprocesses spawned by third-party libraries we use
Any transitive subprocesses spawned by the subprocesses that Mill creates are out of our control, and thus may not have the proper symlinks set up if the working directory differs from their parent
The symlinks are not transparent: user code will be able to see that the mill-workspace folder is a symlink, and some code may not behave correctly when traversing symlinks
Some code demands absolute paths and provides no alternative, e.g. the native-image binary for generating Graal executables.
Any subprocesses using OS-Lib to deserialize the paths (e.g. our own) need to explicitly use the os.Path(_, os.pwd) constructor to allow it to handle relative paths and resolve them from the current working directly.

Notes

There are some non-path-related changes in this PR:

The Zinc incremental compiler has its own ReadWriteMapper mechanism for customizing serialization of paths, so we hook into that in ZincWorkerImpl
- There remains some non-determinism in Zinc that is outside our control in Nondeterministic behavior when running zinc multiple times in different folders sbt/zinc#1540, but it looks like it should be fixable
We tweak the valueHash computation in GroupExecution to take the hash of the serialized JSON, rather than of the original JVM object, since different os.Paths with different hashes may serialize to the same relative path after the working directly has been substituted, and so we need to hash the JSON to make sure we get a stable hash

lihaoyi · 2025-03-04T07:00:42Z

This is getting pretty close to getting ./mill -w 'integration.feature[reproducibility].local.server' mill.integration.ReproducibilityTests.diff to pass. The current and seemingly last blocker is sbt/zinc#1540, where Zinc's analysis files have some inherent non-determinism, but everything else seems to be deterministic.

roman-mibex-2 · 2025-03-07T21:47:28Z

The symlinks can be set up for any subprocess Mill creates via OS-Lib, but we cannot instrument subprocesses created by java.lang.ProcessBuilder, which may include sub-processes spawned by third-party libraries we use

One random idea I had once, but never enough time to explore: Use a java.nio.File system to have a virtual paths. With the idea that:

Generic java libraries would go through the our 'Mill' path resolution
.toStrings would do something like you describe here, with sym-links to out/home, etc: For everything that might end up in an external process.
However, that is a way lager API and I'm not sure if it gives more control in reality.

I guess, if the ProcessBuilder becomes a common issue deep inside other libraries: We still could think of adding a Java Agent to 'fix' paths.

lihaoyi · 2025-03-08T00:30:56Z

@roman-mibex-2 that could work, but one big issue is we need to support passing absolute paths to third-party subprocesses as well, and those are entirely out of the JVM's control. So we do need some kind of OS/FS-level handling to make those relative paths work

Using a java agent to fix paths passed to third part subprocesses won't work, because the subprocesses get a mix of files-on-disk, command-line strings, environment variable strings, any of which could have the paths embedded within them. It's impossible to look at a string or file and identify if it is whole-ly or partially composed of absolute paths, e.g. how would we look at a gzipped messagepack blob on disk and fix up the paths there?

One generalization that could make your idea work is to use a FUSE filesystem to redirect the paths at the filesystem level. Bazel offers this IIRC on linux (https://bazel.build/versions/7.5.0/docs/sandboxing#sandboxfs), not sure how hard it would be to port over to Mill. Running such a FUSE filesystem across different Mac/Linux/Windows environments would also be a challenge (e.g. needing kernel drivers or sudo)

ajaychandran · 2025-03-16T02:42:48Z

Maybe this is helpful.
Turborepo keeps track of environment variables used by tasks. This seems central to their remote caching feature.

ajaychandran · 2025-03-16T02:57:21Z

Another idea is to String.replace the serialized task values with "known things".

lihaoyi · 2026-02-06T04:27:36Z

Rebased this on top of latest main branch, needs sbt/zinc#1638 for the reproducibility test to pass

lefou · 2026-02-06T10:06:08Z

I don't think using sym-links is what we want.

It assumes a hard-coded out/ dir, which we don't want and don't have any longer. In fact, we can configure the out/ today, and it's an important thing to build projects from a read-only source directory.
It's dependent to the state of the executing platform and it's really hard to debug issues caused by someone/something mangling the sym-links.
Since we need to convert paths to sym-link-paths, but the result is itself a valid path, there is much room for misinterpretation and error. What if users use the sym-links in input, that are supposed to be converted to a sym-link-path again? What if sym-linked paths contain a route to other sym-links?

I think the approach to split the problem into a) path mapping and b) paths in configuration data is superior to implicitly persisting via sym-links. I have two draft PRs (that were functional at multiple time points but need rebasing now) that demonstrate the concept and worked. #6031 is for the path mapping and #6129 for the configuration data. Since the latter is using a data structure to hold paths, there is no room for misinterpretation or malicious injection.

One learning outcome was, that we don't need to use the path-mapping when spawning other processes. Or more general, we don't need the path-mapping at build-time¹, but only for serializing the cached results, since that's the only thing we want to share with other Mill processes over space and time. The sym-links try to solve something that's probably not needed at all.

¹ Some tools output we store directly need to know it. E.g. zinc for the incremental state, but there is typically a mechanism to handle that.

lefou · 2026-02-06T10:10:26Z

Another idea is to String.replace the serialized task values with "known things".

We never want to do this on unstructured text. We need to be 100% certain the string to replace is the path we want to mangle.

reviously, Zinc could produce nondeterministic analysis output because `binaryClassName` used `put(binary, className)` from concurrent `externalLibraryDependency` callbacks. The final value depended on callback scheduling, so `binaryClassName` stores one representative class per binary JAR in a concurrent map, but representative selection was non-deterministic. This PR replaces `put` with deterministic merge using `updateWith`, choosing the lexicographically smallest class name for each binary. `libraryClassName` only needs a stable representative class per binary for lookup/stamp checks; selecting a deterministic representative preserves behavior while removing scheduler dependence. This removes one source of nondeterminism in Zinc analysis serialization and downstream build cache hashes, which was blocking attempts at reproducible builds in Mill (e.g. com-lihaoyi/mill#4642)

lihaoyi · 2026-02-08T00:52:46Z

All of the problems you bring up are true, but we know that symlinked builds basically work despite those problems since that's the approach used by Bazel. Bazel does suffer all the problems, but it works well enough to be useful.

This is extremely messy problem space, so it's basically impossible to make it work seamlessly 100% of the time. But I'm less confident that a bespoke design will cover everything we need, v.s. Bazel's battle-tested approach with known flaws and shortcomings

lefou · 2026-02-08T10:20:40Z

I simply can't hear the "Bazel does it, so it's blessed" argument anymore. All Mill tries is to do stuff better than other tools, if we see some potential room for improvement.
The JVM world has enough build tools already. If we don't improve over what these already provide, there is no need for Mill. I still have to find a single project driven by Bazel that's easy to use and understand. That can't be the measuring staff.

lihaoyi · 2026-02-08T11:04:27Z

Bazel has a lot of issues, but in many areas it is AFAIK the state of the art, and one of those areas is the space of reproducible/relocatable builds.

If there are better designs we can compare it against I'm all ears, but from what I've seen the approaches taken by other tools such as SBT/Gradle/etc. are generally much messier, have much higher API surface area, and are easier for users and plugins to screw up.

lihaoyi force-pushed the reproducible-2 branch from bd247af to 6398ba3 Compare March 4, 2025 06:29

lihaoyi mentioned this pull request Mar 4, 2025

Nondeterministic behavior when running zinc multiple times in different folders sbt/zinc#1540

Open

lihaoyi force-pushed the reproducible-2 branch from 85f2113 to f36c476 Compare March 4, 2025 06:53

lihaoyi changed the title ~~Reproducible 2~~ 2nd attempt at Reproducible out folder contents Mar 4, 2025

lihaoyi force-pushed the reproducible-2 branch from 61d1536 to db97983 Compare March 4, 2025 07:48

lihaoyi force-pushed the main branch 2 times, most recently from 1d3b959 to 3ba698a Compare July 10, 2025 04:00

lihaoyi mentioned this pull request Oct 29, 2025

Dynamically map known root paths in cache files #6031

Closed

lefou added the later The issue is still relevant, but has now high priority right now label Oct 30, 2025

lihaoyi closed this Nov 2, 2025

lihaoyi added 2 commits February 6, 2026 10:59

merged

3208a7c

merged

eb45996

lihaoyi mentioned this pull request Feb 6, 2026

Make libraryClassName relation deterministic under concurrency sbt/zinc#1638

Merged

merged

0c15f18

lihaoyi reopened this Feb 6, 2026

lihaoyi force-pushed the reproducible-2 branch from e218f04 to 0c15f18 Compare February 6, 2026 04:18

lihaoyi added 2 commits February 9, 2026 21:57

.

dcaa70a

.

ad5783e

lihaoyi added 2 commits February 10, 2026 10:32

.

81b8539

.

f6ba31d

lihaoyi force-pushed the reproducible-2 branch from d878516 to f6ba31d Compare February 10, 2026 03:52

lihaoyi added 3 commits February 10, 2026 19:24

.

0a5fd51

.

94e1ae7

.

758de9f

lihaoyi force-pushed the reproducible-2 branch from 6ccbf27 to 758de9f Compare February 10, 2026 22:36

lihaoyi added 5 commits February 11, 2026 06:37

.

a26809a

.

e789349

.

5ea9639

.

1456261

.

a89c243

lihaoyi force-pushed the reproducible-2 branch from 5659fc5 to a89c243 Compare February 10, 2026 23:27

lihaoyi added 2 commits February 11, 2026 08:01

.

bd65c3d

.

365a4b0

lihaoyi force-pushed the reproducible-2 branch from 1550fc5 to 365a4b0 Compare February 11, 2026 02:25

lihaoyi added 2 commits February 11, 2026 11:09

.

4801da3

.

7a5de49

lihaoyi force-pushed the reproducible-2 branch from 52df04d to 7a5de49 Compare February 11, 2026 03:25

lihaoyi added 2 commits February 11, 2026 16:43

.

50df994

.

67dd65b

lihaoyi force-pushed the reproducible-2 branch from 95b5744 to 67dd65b Compare February 11, 2026 08:44

lihaoyi added 4 commits February 11, 2026 16:48

.

2e9b801

.

67091df

.

8da2402

.

3d47776

lihaoyi force-pushed the reproducible-2 branch from 6df1ea9 to 3d47776 Compare February 11, 2026 09:33

.

4098c0e

lihaoyi force-pushed the reproducible-2 branch from 4c94600 to 4098c0e Compare February 11, 2026 12:33

[autofix.ci] apply automated fixes

81689ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2nd attempt at Reproducible out folder contents #4642

2nd attempt at Reproducible out folder contents #4642

lihaoyi commented Mar 4, 2025 •

edited

Loading

Uh oh!

lihaoyi commented Mar 4, 2025

Uh oh!

roman-mibex-2 commented Mar 7, 2025

Uh oh!

lihaoyi commented Mar 8, 2025 •

edited

Loading

Uh oh!

ajaychandran commented Mar 16, 2025 •

edited

Loading

Uh oh!

ajaychandran commented Mar 16, 2025 •

edited

Loading

Uh oh!

lihaoyi commented Feb 6, 2026

Uh oh!

lefou commented Feb 6, 2026 •

edited

Loading

Uh oh!

lefou commented Feb 6, 2026

Uh oh!

lihaoyi commented Feb 8, 2026

Uh oh!

lefou commented Feb 8, 2026 •

edited

Loading

Uh oh!

lihaoyi commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

2nd attempt at Reproducible out folder contents #4642

Are you sure you want to change the base?

2nd attempt at Reproducible out folder contents #4642

Conversation

lihaoyi commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Serializing Absolute Paths to Relative Paths

Implementation Details

Limitations

Notes

Uh oh!

lihaoyi commented Mar 4, 2025

Uh oh!

roman-mibex-2 commented Mar 7, 2025

Uh oh!

lihaoyi commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajaychandran commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajaychandran commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Feb 6, 2026

Uh oh!

lefou commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lefou commented Feb 6, 2026

Uh oh!

lihaoyi commented Feb 8, 2026

Uh oh!

lefou commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lihaoyi commented Mar 4, 2025 •

edited

Loading

lihaoyi commented Mar 8, 2025 •

edited

Loading

ajaychandran commented Mar 16, 2025 •

edited

Loading

ajaychandran commented Mar 16, 2025 •

edited

Loading

lefou commented Feb 6, 2026 •

edited

Loading

lefou commented Feb 8, 2026 •

edited

Loading