Skip to content

Conversation

@pieterh-oai
Copy link
Contributor

@pieterh-oai pieterh-oai commented Dec 2, 2025

Summary

This PR extends generate.py to codegen Python shims for the (entire) Ruff AST,
replacing catchall Node in the previous PR (#21415). The shims work as follows:

  • Python AST node classes (and objects) are created entirely on the Rust side
    as PyO3 types.
  • From the Python code's point of view, node objects have the expected fields
    with names and types that line up with the Rust AST. The 'API' is a set of pyi
    files in ruff_linter/resources/.
  • The majority of AST node fields are lazily loaded (from the Rust AST to the
    projected Python AST) on first use. A subset of fields are marked as 'eager'
    (e.g. op for AugAssign is string rather than a node subclass). The
    configuration of lazy vs. eager projection is configured in ast.toml.
  • We use RawNode as a catch-all for any node types that are somehow not
    covered by the projection code. This means it's effectively unused right
    now, but might be in the future if AST changes cause the codegen to be out
    of date.

The codegen in generate.py needs to generate 3 things:

  1. PyO3 types for each AST node type; see write_projection_bindings and
    the result in ruff_linter/src/external/ast/python/generated.rs.
  2. Per-node-type projection code; see write_projection_helpers and
    the corresponding output in projection.rs
  3. Python .pyi interface files; see write_python_stub and
    the output in ruff_linter/resources/ruff_external.

Lifetime-related considerations

The challenging parts (at least for me) were handling lifetime differences
between PyO3 AST node objects and the Ruff types that they need to refer to,
the SourceFile and the corresponding Ruff AST node.

Once on the Python-side heap, it's not really possible to limit object lifetime
(since we can't really make assertions when their refcount will hit 0, or
if they'll need to be GC'd). This means that we can't include direct references to the
shorter-lived Rust-side objects in any PyO3 objects.

It's not too hard to satisfy the Ruff compiler by cloning Ruff AST nodes and Arc
references to the source, but that isn't efficient. Copying entire AST subtrees
defeats the purpose of making field access lazy in the first place, and holding
references through the SourceFile would mean that any file with at least 1 Python
rule execution (even if it yields 0 results) will be held in memory until end of process
(or, at best, a future CPython GC cycle, I guess?).

On the flip side, if we avoid copying nodes and strong references to the source,
we need to lifecycle-manage this so that the Python code fails in a predictable
way. We probably want to fail early if a Python rule does something it shouldn't
(like storing nodes in a global and then trying to use that across check_expr
invocations), rather than arbitrarily failing later.

This is what I ended up doing:

  • AstStore maps ID numbers to Ruff AST nodes to support lazy loading and
    (in a future PR) Semantic calls. This mapping is short-lived (per rule
    invocation), and the ExternalCheckerContext explicitly calls invalidate
    so that, as soon as rule dispatch ends, future access to Python AST nodes
    will fail (* at least for any fields that were not already loaded).

  • SourceFileHandle uses a threadlocal global to store the current SourceFile;
    once it goes out of scope, the reference to SourceFile is unset and any
    subsequent calls to locator will fail.

The code in store.rs and source.rs does more or less the same thing; I'd
be interested in feedback on how to clean this up more / how to make this
more idiomatic.

Other tradeoffs:

  • Lazy loading means that we generate very few Python heap objects up front.
    Calls from Python into Rust appear to be relatively inexpensive because
    of the 'single interpreter with multiple attached OS threads that
    don't interact' setup.

  • This exposes the Ruff AST directly (though the codegen is structured
    in a way that could permit some rewriting).

    • The downside is that future changes to the AST will directly affect the
      Python API.
    • The advantage is that we can expose APIs like Semantic directly
      to Python rules. This means that Python rules look/feel very
      similar to equivalent rules implemented in Rust.
  • This is obviously a big PR, with 1kloc of new stuff in generate.py
    and 10kloc of generated code. I have put some, but not a lot, of work
    into trying to simplify the codegen parts.

Test Plan

cargo test --workspace --exclude ty
source ./scripts/nogil/nogil_env.sh ~/cpython_3_13_nogil/
cargo test --workspace --exclude ty --features ext-lint
uvx pre-commit run --all-files --show-diff-on-failure

@pieterh-oai pieterh-oai changed the title [ruff][ext-lint] 4: codegen for the full AST RFC [ruff][ext-lint] 4: codegen for the full AST Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant