Skip to content

Conversation

@david-christiansen
Copy link
Contributor

@david-christiansen david-christiansen commented Jan 19, 2026

This PR adds a SQLite database that contains all of the documentation info. The plan is that Verso can use this database rather than consulting the environment for docstrings, and that it will enable other tools to do useful things that we haven't yet anticipated.

This is a work in progress - I'm opening the PR so I can run Radar on it and see what the present overhead of building the DB is.

Remaining work:

  • Generate a SQLite database from all of doc-gen's data
  • Ensure that the overhead is acceptable when generating Mathlib docs
  • Generate documentation from DB
  • Verify that DB-generated HTML is close enough to the current HTML
  • Disable legacy HTML generation and organize Lake facets accordingly
  • Check performance again

Comments:

To check that HTML generated from the database is close enough to the current output, I wrote two Python scripts that are essentially checking the same properties differently. They check that the generated HTML is the same, modulo the following differences:

  1. Repeated imports in the import list may be deduplicated in the new HTML
  2. Links to nonexistent anchors may be removed or replaced in the new HTML
  3. Links to things that exist in the current HTML must be links in the new HTML, but they can point at something different (e.g. if a name resolves to something else in a more complete context, that's OK - there was exactly one instance of this in Lean itself)

This PR adds a SQLite database that contains all of the documentation
info.
@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 19, 2026

Benchmark results for 7f88ca7 against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

david-christiansen commented Jan 19, 2026

Significance detection isn't going yet due to there not being enough data. The result for this version isn't great, so more work is needed:

mathlib-docs // instructions 258.8T +175.4T +210.4% runner-mathlib1
mathlib-docs // maxrss 6 GiB -4 MiB -0.1% B runner-mathlib1
mathlib-docs // task-clock 14h 12m 51s +10h 25m 39s +275.4% s runner-mathlib1
mathlib-docs // wall-clock 12m 17s +7m 25s +152.8% s runner-mathlib1
own-docs // instructions 3.7T +50.7G +1.4% runner-mathlib1
own-docs // maxrss 3 GiB -21 MiB -0.6% B runner-mathlib1
own-docs // task-clock 5m 51s +6s +1.9% s runner-mathlib1
own-docs // wall-clock 2m 39s +2s +1.3% s runner-mathlib1
radar/run/main // time 17m 9s +7m 40s +80.9% s runner-mathlib1
radar/run/main/script // time 17m 8s +7m 40s +81.1% s runner-mathlib1

task-clock and wall-clock for Mathlib builds are the most important measurements here.

Experiment to see if slowdown due to database file contention
@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 20, 2026

Benchmark results for 9272ece against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 20, 2026

Benchmark results for 418fa51 against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 20, 2026

Benchmark results for 76667ea against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 20, 2026

Benchmark results for e429f8e against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 20, 2026

Benchmark results for e1d2a8b against 837f89a are in! @david-christiansen

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

As of e29f8e, it's:

mathlib-docs // instructions 84.6T +1.2T +1.5%
mathlib-docs // maxrss 6 GiB -4 MiB -0.1% B
mathlib-docs // task-clock 3h 49m 49s +2m 37s +1.2% s
mathlib-docs // wall-clock 4m 58s +6s +2.3% s
own-docs // instructions 3.7T +40.6G +1.1%
own-docs // maxrss 3 GiB +39 MiB +1.2% B
own-docs // task-clock 5m 55s +10s +3.1% s
own-docs // wall-clock 2m 41s +4s +3.1% s
radar/run/main // time 19m 9s +9m 40s +102.0% s
radar/run/main/script // time 19m 8s +9m 40s +102.3% s

Seems that the Mathlib cache was only getting partial values.

@david-christiansen
Copy link
Contributor Author

With the original PR code, it's comparable:

mathlib-docs // instructions 84.6T +1.2T +1.4%
mathlib-docs // maxrss 6 GiB -3 MiB -0.0% B
mathlib-docs // task-clock 3h 50m 4s +2m 52s +1.3% s
mathlib-docs // wall-clock 4m 58s +6s +2.3% s
own-docs // instructions 3.7T +52.2G +1.4%
own-docs // maxrss 3 GiB +75 MiB +2.3% B
own-docs // task-clock 5m 51s +7s +2.1% s
own-docs // wall-clock 2m 40s +3s +1.9% s
radar/run/main // time 19m 10s +9m 41s +102.2% s
radar/run/main/script // time 19m 9s +9m 41s +102.4% s

Extensions are not presently handled, but the fallback data are saved.
This is the first step towards rendering HTML from the DB instead of
directly. The serializable version of CodeWithInfos used here can be
saved in the DB. The generated HTML is the same, modulo commit hashes
and external URLs.
@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 27, 2026

Benchmark results for ec4cf3e against 837f89a are in! @david-christiansen

  • mathlib-docs//instructions: +1.3T (+1.6%)
  • mathlib-docs//maxrss: -2MiB (-0.0%)
  • mathlib-docs//task-clock: +3m 23s (+1.5%)
  • mathlib-docs//wall-clock: +7s (+2.7%)
  • own-docs//instructions: +91.8G (+2.5%)
  • own-docs//maxrss: +80MiB (+2.4%)
  • own-docs//task-clock: +11s (+3.4%)
  • own-docs//wall-clock: +4s (+2.7%)

No significant changes detected.

This is preliminary to generating HTML from the database. The output
is still unchanged, modulo commit hashes and source URLs.
@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 27, 2026

Benchmark results for 9489cfd against 837f89a are in! @david-christiansen

  • mathlib-docs//instructions: +1.2T (+1.4%)
  • mathlib-docs//maxrss: -3MiB (-0.1%)
  • mathlib-docs//task-clock: +2m 37s (+1.2%)
  • mathlib-docs//wall-clock: +9s (+3.4%)
  • own-docs//instructions: +51.9G (+1.4%)
  • own-docs//maxrss: -821MiB (-24.6%)
  • own-docs//task-clock: +706ms (+0.2%)
  • own-docs//wall-clock: -378ms (-0.2%)

No significant changes detected.

The scripts indicate that the output is the same, modulo minor
differences in automatic linking
@david-christiansen david-christiansen changed the title feat: save documentation info to SQLite databse feat: save documentation info to SQLite database Jan 30, 2026
@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Jan 30, 2026

Benchmark results for c6f2804 against 4dbbb80 are in! @david-christiansen

  • mathlib-docs//instructions: +10.2T (+12.12%)
  • mathlib-docs//maxrss: +2MiB (+0.03%)
  • mathlib-docs//task-clock: +35m 45s (+15.15%)
  • mathlib-docs//wall-clock: +31s (+10.46%)
  • own-docs//instructions: +1.4T (+37.17%)
  • own-docs//maxrss: -846MiB (-26.01%)
  • own-docs//task-clock: +4m 23s (+74.36%)
  • own-docs//wall-clock: +3s (+2.15%)

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

Timing info for the new HTML generation step:

Loading shared index from database: /tmp/tmp.rzajn4HxJa/mathproject/.lake/build/lean-docs.db
Index loaded in 739ms (397460 declarations, 9978 modules)
Hierarchy took 0ms
Context took 16ms
Generating HTML in parallel to: /tmp/tmp.rzajn4HxJa/mathproject/.lake/build/doc-from-db
HTML took 22564ms
HTML index took 7307ms

This is about 30 seconds at the end, after everything is built, which seems not terrible. I'll try disabling the old HTML and see how the benchmark looks.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Feb 2, 2026

Benchmark results for c169a17 against 4dbbb80 are in! @david-christiansen

  • mathlib-docs//instructions: -80.0T (-95.23%)
  • mathlib-docs//maxrss: -2GiB (-60.67%)
  • mathlib-docs//task-clock: -3h 47m 15s (-96.26%)
  • mathlib-docs//wall-clock: -2m 21s (-46.24%)
  • own-docs//instructions: +701.9G (+19.05%)
  • own-docs//maxrss: -870MiB (-26.75%)
  • own-docs//task-clock: +3m 39s (+61.79%)
  • own-docs//wall-clock: -4s (-2.72%)

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Feb 2, 2026

Benchmark results for 569cd63 against 4dbbb80 are in! @david-christiansen

  • 🟥 main exited with code 1

No significant changes detected.

@david-christiansen
Copy link
Contributor Author

!bench

@leanprover-radar
Copy link

leanprover-radar commented Feb 2, 2026

Benchmark results for fa6ae51 against 4dbbb80 are in! @david-christiansen

  • 🟥 main exited with code 1

No significant changes detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants