From 47bba89982360036828eb8ae2800dfd3bffd4eb1 Mon Sep 17 00:00:00 2001 From: Pablo Galindo Salgado Date: Mon, 6 Oct 2025 01:42:32 +0100 Subject: [PATCH 1/3] PEP 810: Add section about performance --- peps/pep-0810.rst | 114 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) diff --git a/peps/pep-0810.rst b/peps/pep-0810.rst index c619aaf0954..d44b6aafd28 100644 --- a/peps/pep-0810.rst +++ b/peps/pep-0810.rst @@ -734,6 +734,120 @@ Subinterpreters are supported. Each subinterpreter maintains its own ``sys.lazy_modules`` and import state, so lazy imports in one subinterpreter do not affect others. +Performance +----------- + +Lazy imports have **no measurable performance overhead**. The implementation +is designed to be performance-neutral for both code that uses lazy imports and +code that doesn't. + +Runtime performance +~~~~~~~~~~~~~~~~~~~ + +After reification (first use), lazy imports have **zero overhead**. The +adaptive interpreter specializes the bytecode (typically after 2-3 accesses), +eliminating any checks. For example, ``LOAD_GLOBAL`` becomes +``LOAD_GLOBAL_MODULE``, which directly accesses the module identically to +normal imports. + +The `pyperformance suite`_ confirms the implementation is performance-neutral. + +.. _pyperformance suite: https://github.com/facebookexperimental/ + free-threading-benchmarking/blob/main/results/bm-20250922-3.15.0a0-27836e5/ + bm-20250922-vultr-x86_64-DinoV-lazy_imports-3.15.0a0-27836e5-vs-base.svg + +Filter function performance +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The filter function (set via ``sys.set_lazy_imports_filter()``) is called for +every *potentially lazy* import to determine whether it should actually be +lazy. The filter has **almost no measurable performance cost**. To measure +this, we benchmarked importing all 278 top-level importable modules from the +Python standard library (which transitively loads 392 total modules including +all submodules and dependencies), then forced reification of every loaded +module to ensure everything was fully materialized. We compared four different +configurations: + +.. list-table:: + :header-rows: 1 + :widths: 50 25 25 + + * - Configuration + - Mean ± Std Dev (ms) + - Overhead vs Baseline + * - **Eager imports** (baseline) + - 161.2 ± 4.3 + - 0% + * - **Lazy + filter forcing eager** + - 161.7 ± 4.2 + - +0.3% ± 3.7% + * - **Lazy + filter allowing lazy + reification** + - 162.0 ± 4.0 + - +0.5% ± 3.7% + * - **Lazy + no filter + reification** + - 161.4 ± 4.3 + - +0.1% ± 3.8% + +The four configurations: + +1. **Eager imports (baseline)**: Normal Python imports with no lazy machinery. + Standard Python behavior. + +2. **Lazy + filter forcing eager**: Filter function returns ``False`` for all + imports, forcing eager execution, then all imports are reified at script + end. Measures pure filter calling overhead since every import goes through + the filter but executes eagerly. + +3. **Lazy + filter allowing lazy + reification**: Filter function returns + ``True`` for all imports, allowing lazy execution. All imports are reified + at script end. Measures filter overhead when imports are actually lazy. + +4. **Lazy + no filter + reification**: No filter installed, imports are lazy + and reified at script end. Baseline for lazy behavior without filter. + +The benchmarks used `hyperfine `_, +testing 278 standard library modules. Each ran in a fresh Python process. +All configurations force the import of exactly the same set of modules +(all modules loaded by the eager baseline) to ensure a fair comparison. + +The benchmark environment used CPU isolation with 32 logical CPUs (0-15 at +3200 MHz, 16-31 at 2400 MHz), the performance scaling governor, Turbo Boost +disabled, and full ASLR randomization. The overhead error bars are computed +using standard error propagation for the formula ``(value - baseline) / +baseline``, accounting for uncertainties in both the measured value and the +baseline. + +Startup time improvements +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The primary performance benefit of lazy imports is reduced startup time by +loading only the modules actually used at runtime, rather than optimistically +loading entire dependency trees at startup. + +Real-world deployments at scale have demonstrated that the benefits can be +massive, though of course this depends on the specific codebase and usage +patterns. Organizations with large, interconnected codebases have reported +substantial reductions in server reload times, ML training initialization, +command-line tool startup, and Jupyter notebook loading. Memory usage +improvements have also been observed as unused modules remain unloaded. + +For detailed case studies and performance data from production deployments, +see: + +- `Python Lazy Imports With Cinder + `__ + (Meta Instagram Server) +- `Lazy is the new fast: How Lazy Imports and Cinder accelerate machine + learning at Meta + `__ + (Meta ML Workloads) +- `Inside HRT's Python Fork + `__ + (Hudson River Trading) + +The benefits scale with codebase complexity: the larger and more +interconnected the codebase, the more dramatic the improvements. + Typing and tools ---------------- From 4803ff6f4aa72ef434038d13d1dbca3603c4e174 Mon Sep 17 00:00:00 2001 From: Pablo Galindo Salgado Date: Mon, 6 Oct 2025 01:51:24 +0100 Subject: [PATCH 2/3] Small clarification --- peps/pep-0810.rst | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/peps/pep-0810.rst b/peps/pep-0810.rst index d44b6aafd28..5bff6f2793a 100644 --- a/peps/pep-0810.rst +++ b/peps/pep-0810.rst @@ -765,8 +765,17 @@ lazy. The filter has **almost no measurable performance cost**. To measure this, we benchmarked importing all 278 top-level importable modules from the Python standard library (which transitively loads 392 total modules including all submodules and dependencies), then forced reification of every loaded -module to ensure everything was fully materialized. We compared four different -configurations: +module to ensure everything was fully materialized. + +Note that these measurements establish the baseline overhead of the filter +mechanism itself. Of course, any user-defined filter function that performs +additional work beyond a trivial check will add overhead proportional to the +complexity of that work. However, we expect that in practice this overhead +will be dwarfed by the performance benefits gained from avoiding unnecessary +imports. The benchmarks below measure the minimal cost of the filter dispatch +mechanism when the filter function does essentially nothing. + +We compared four different configurations: .. list-table:: :header-rows: 1 From 513734f586e073ec8328e307e20dfbaca6665250 Mon Sep 17 00:00:00 2001 From: Pablo Galindo Salgado Date: Mon, 6 Oct 2025 18:44:49 +0100 Subject: [PATCH 3/3] fixup! Small clarification --- peps/pep-0810.rst | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/peps/pep-0810.rst b/peps/pep-0810.rst index 5bff6f2793a..bf6f34830ce 100644 --- a/peps/pep-0810.rst +++ b/peps/pep-0810.rst @@ -761,11 +761,14 @@ Filter function performance The filter function (set via ``sys.set_lazy_imports_filter()``) is called for every *potentially lazy* import to determine whether it should actually be -lazy. The filter has **almost no measurable performance cost**. To measure -this, we benchmarked importing all 278 top-level importable modules from the -Python standard library (which transitively loads 392 total modules including -all submodules and dependencies), then forced reification of every loaded -module to ensure everything was fully materialized. +lazy. When no filter is set, this is simply a NULL check (testing whether a +filter function has been registered), which is a highly predictable branch that +adds essentially no overhead. When a filter is installed, it is called for each +potentially lazy import, but this still has **almost no measurable performance +cost**. To measure this, we benchmarked importing all 278 top-level importable +modules from the Python standard library (which transitively loads 392 total +modules including all submodules and dependencies), then forced reification of +every loaded module to ensure everything was fully materialized. Note that these measurements establish the baseline overhead of the filter mechanism itself. Of course, any user-defined filter function that performs