RFC [ruff][ext-lint] 4: codegen for the full AST #21764
Open
+19,086
−470
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR extends
generate.pyto codegen Python shims for the (entire) Ruff AST,replacing catchall Node in the previous PR (#21415). The shims work as follows:
as PyO3 types.
with names and types that line up with the Rust AST. The 'API' is a set of
pyifiles in
ruff_linter/resources/.projected Python AST) on first use. A subset of fields are marked as 'eager'
(e.g.
opforAugAssignis string rather than a node subclass). Theconfiguration of lazy vs. eager projection is configured in
ast.toml.RawNodeas a catch-all for any node types that are somehow notcovered by the projection code. This means it's effectively unused right
now, but might be in the future if AST changes cause the codegen to be out
of date.
The codegen in
generate.pyneeds to generate 3 things:write_projection_bindingsandthe result in
ruff_linter/src/external/ast/python/generated.rs.write_projection_helpersandthe corresponding output in
projection.rs.pyiinterface files; seewrite_python_stubandthe output in
ruff_linter/resources/ruff_external.Lifetime-related considerations
The challenging parts (at least for me) were handling lifetime differences
between PyO3 AST node objects and the Ruff types that they need to refer to,
the
SourceFileand the corresponding Ruff AST node.Once on the Python-side heap, it's not really possible to limit object lifetime
(since we can't really make assertions when their refcount will hit 0, or
if they'll need to be GC'd). This means that we can't include direct references to the
shorter-lived Rust-side objects in any PyO3 objects.
It's not too hard to satisfy the Ruff compiler by cloning Ruff AST nodes and
Arcreferences to the source, but that isn't efficient. Copying entire AST subtrees
defeats the purpose of making field access lazy in the first place, and holding
references through the
SourceFilewould mean that any file with at least 1 Pythonrule execution (even if it yields 0 results) will be held in memory until end of process
(or, at best, a future CPython GC cycle, I guess?).
On the flip side, if we avoid copying nodes and strong references to the source,
we need to lifecycle-manage this so that the Python code fails in a predictable
way. We probably want to fail early if a Python rule does something it shouldn't
(like storing nodes in a global and then trying to use that across
check_exprinvocations), rather than arbitrarily failing later.
This is what I ended up doing:
AstStoremaps ID numbers to Ruff AST nodes to support lazy loading and(in a future PR) Semantic calls. This mapping is short-lived (per rule
invocation), and the
ExternalCheckerContextexplicitly callsinvalidateso that, as soon as rule dispatch ends, future access to Python AST nodes
will fail (* at least for any fields that were not already loaded).
SourceFileHandleuses a threadlocal global to store the currentSourceFile;once it goes out of scope, the reference to
SourceFileis unset and anysubsequent calls to
locatorwill fail.The code in
store.rsandsource.rsdoes more or less the same thing; I'dbe interested in feedback on how to clean this up more / how to make this
more idiomatic.
Other tradeoffs:
Lazy loading means that we generate very few Python heap objects up front.
Calls from Python into Rust appear to be relatively inexpensive because
of the 'single interpreter with multiple attached OS threads that
don't interact' setup.
This exposes the Ruff AST directly (though the codegen is structured
in a way that could permit some rewriting).
Python API.
Semanticdirectlyto Python rules. This means that Python rules look/feel very
similar to equivalent rules implemented in Rust.
This is obviously a big PR, with 1kloc of new stuff in
generate.pyand 10kloc of generated code. I have put some, but not a lot, of work
into trying to simplify the codegen parts.
Test Plan