Skip to content

Conversation

@arturaz
Copy link
Collaborator

@arturaz arturaz commented Sep 11, 2025

This PR optimizes and reworks the compilation pipeline, with regards to SemanticDB generation.

Pre-PR situation

  • there were separate compile and semanticDbDetailed tasks.
  • compile performed the compilation without semantic DB plugin.
  • semanticDbDetailed performed the compilation with the semantic DB plugin, but did not reuse the compile's output.

Which meant that if any of these happened:

  • mill cli invoked compile and then semanticDbData.
  • mill BSP invoked compile and then any other task that required the semantic db, or vice-versa.

The compilation would have been performed twice, wasting CPU cycles and worsening the developer experience.

Post PR situation

Mill now smartly chooses whether compile produces semanticdb data or not. semanticdb is produced if:

  • compile was directly invoked by a task that needs semanticdb.
  • there is at least one BSP client that requires semanticdb to be produced.

Implementation details

Introduction of MILL_BSP_OUTPUT_DIR

Previously you could use MILL_OUTPUT_DIR environment variable to set both regular and bsp mill's output directory to a certain folder.

Because regular mill now needs to know the location of the BSP folder, having one variable is problematic:

  • you run MILL_OUTPUT_DIR=out_bsp ./mill --bsp ...
  • you want to run regular mill, but provide it a changed path for bsp mill.
  • MILL_OUTPUT_DIR=out_bsp ./mill ... changes the regular mill out folder.

Thus MILL_BSP_OUTPUT_DIR is introduced, which allows you to:

  • MILL_BSP_OUTPUT_DIR=out_bsp ./mill --bsp ...
  • MILL_BSP_OUTPUT_DIR=out_bsp ./mill ... # this still uses the regular out/ folder, but knows where bsp mill out/ folder is located

BuildCtx.bspSemanticDbSessionsFolder

Folder in the filesystem where Mill's BSP sessions that require semanticdb store an indicator file (name = process PID, contents are irrelevant) to communicate to main Mill daemon and other BSP sessions that there is at least one Mill session that will need the semanticdb.

The reasoning is that if at least one of Mill's clients requests semanticdb, then there is no point in running regular compile without semanticdb, as eventually we will have to rerun it with semanticdb, and thus we should compile with semanticdb upfront to avoid paying the price of compling twice (without semanticdb and then with it).

CompilationResult.semanticDbFiles

Because we can't change the return type of compile due to binary compatibility, semanticDbFiles field was added to CompilationResult and compile fills it in if the compilation happened with semanticdb enabled.

Removal of CompileFor and related tasks

These are not needed anymore with the single compile task.

Removal of separate SemanticDbJavaModule.semanticDbDataDetailed task

It's functionality was merged to compileInternal, which takes a compileSemanticDb parameter.

There also was a lot of code duplication between compile and semanticDbDataDetailed tasks, for both java and scala modules.

Replace the implicit Task.dest usage in ZincWorker with explicit compileTo argument

This makes it clearer what the parameter is used for and allows to reuse the same value in SemanticDbJavaModule.enhanceCompilationResultWithSemanticDb invocation.

Misc changes

  • Moved JavaModule#resolveRelativeToOut instance method to UnresolvedPath.resolveRelativeToOut.
  • Improved Server to provide better debugging output if the server cannot be launched.
  • testScala212Version updated to 2.12.20 because new semanticdb plugin is not provided for the ancient 2.12.6 version that was used.

Fixes: #5744

@arturaz arturaz marked this pull request as draft September 11, 2025 10:35
@arturaz arturaz marked this pull request as ready for review October 3, 2025 10:59
@lefou
Copy link
Member

lefou commented Oct 26, 2025

TBH, the current state (of this PR) looks way to complicated and makes too much assumptions for my taste. But I admit, the current codebase (in this context) is already far from being simple.

I'd like to split the problem and provide clear solutions separated from each other.

Issue 1: Bad compilation performance

  • we currently compile too much, since the semanticDbData task duplicates compilation work already done in the compile task.
  • by always compiling with the semanticDB generator enabled, we could optimize the compilation for the BSP use case and would also ensure sync'ed results, but we potenially leak unwanted semanticDb data downstream.

Issue 2: Decide when we need semanticDB data

  • explicit: user enabled semanticDB in the module via scalacOptions - the compile result will contain the semanticDB data files and we should not not apply any extra processing
  • implicit: user uses Metals as the IDE - the compile task should not contain any semanticDB data files, as these are considered unwanted results (e.g. they should not appear downstream on classpaths or in jars)
  • no: we don't need semanticDB data at all - we should not generate it

Proposal for Issue 1:

Disclaimer: proposed task names are not final but choosen to make the concept clear

  • create a new persistent compileWithMaybeSemanticDb task which does the actual compilation, and include semanticDB data if we, for some reasons, need them.
  • Let the compile task use the result of compileWithMaybeSemanticDb but filter out semanticDB data, iff it was not explicitly requested by the module configuration, e.g. via scalacOptions.
  • Let the semanticDbData task use the result of compileWithMaybeSemanticDb and filter out any non-semanticDB data.
  • We keep the current concept of well-separated tasks with well-defined results. All downstream users, esp. the BSP client or mill-scalafix plugin keep as-is, but better performing.

Ideas for Issue 2:

  1. Maybe too simple, but we could always generate semanticDB data in compileWithMaybeSemanticDb and just don't use it downstream if nobody is interested in it. This has an overhead of up to 20 percent in case nobody is going to need it. (Very unlikely, but it may also conflict with other compiler plugins and fail the compilation that otherwise would succeed.)

  2. Smart-decision for semanticDB data need. Either project use of the semanticDbData task or any BSP use should permanently enable it. This must be a bullet-proof design, well-documented and users need a way to disable it (opt-out or opt-in).
     
    To detect BSP, we should just use the fact that a Mill-generated .bsp/mill-bsp.json file is present, since this won't require any extra book keeping. Users, who don't want to use BSP can also safely remove that file. We could also write an extra file next to this location, so we can check its age, for example.
     
    Once the semanticDbData task is used/planned, we may record that fact under the namespace of a dedicated module-specific persistent task semanticDbDataGenerationWanted. This has the issue that the semanticDbData task is a downstream dependency of compileWithMaybeSemanticDb task and we currently can't know if the initial value is correct (so factually we always need to guess "enabled"). It would be really cool to have some way to detect early what the user is going to run. E.g. if we could expose the current execution plan via the TaskCtx. That way we could have an early persistent task semanticDbDataGenerationWanted that decides based on it's persistent state and the fact whether the semanticDbData task is requested. Then we could conservatively default to disabled, unless there is more evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bspBuildTargetScalaMainClasses and bspBuildTargetScalaTestClasses depend on compile rather than semanticDbDataDetailed

2 participants