Skip to content

Conversation

@mlechu
Copy link
Collaborator

@mlechu mlechu commented Aug 14, 2025

Three simple SyntaxGraph/SyntaxTree utilities. The semantics described in
the docstrings are probably worth reading first; this PR description is mainly
justification.

Let me know if I guessed the division between syntax_graph.jl and ast.jl wrong
(Underlying data structure vs. julia syntax-y stuff? There is some overlap).

prune

I'm considering different ways of making JuliaLowering's provenance usable for
the rest of the compiler. One such way would be to store SyntaxTrees directly
in the sysimage, but they are currently far too large for that. One reason is
that nodes are never deleted through lowering (good thing). This change adds
the ability to extract a useful subset of nodes from a SyntaxTree once we're
done lowering.

This is implemented in two ways so we can see which we like better. prune_a
leaves nodes with multiple parents the same, and prune_u ensures that nodes
don't have more than one parent first (see unalias_nodes below). This loses
space saved by the DAG representation, but allows us to store a level-order
traversal of the graph in the edge_ranges field and remove the edges field
entirely. The algorithm for prune_u is also simpler. Feedback welcome on
which you prefer @c42f; I don't intend to keep both.

Worst case (i.e. SyntaxTrees are too large or otherwise not usable in the
sysimage) we still get a handy utility for tools like the language server, which
will probably be caching a lot of syntax trees. In general, I don't think most
consumers will need a chain of provenance that includes intermediate
lowering-produced nodes, so we might benefit from omitting them by default.

annotate_parent!

This is information present in SyntaxNode not carried over to SyntaxTree for
obvious DAG reasons, but the utility below makes it trivial to add. Given that
syntaxnode.parent is used extensively in the wild (I've missed it in language
server development), I lean towards including this function in JuliaLowering
before a hundred packages need to implement it separately.

unalias_nodes

Generally useful for analyzing SyntaxTrees. Examples where I've wanted this:

  • At the juliacon hackathon, I made a very early prototype of a replacement for TypedSyntax.jl.
    I don't think it's correct to annotate nodes with types without unaliasing
    first.
  • Before knowing about JuliaSyntaxFormatter, I was thinking about writing a
    simple formatter that produces deterministic output text given a syntax tree.
    This would involve a pass annotating each node with an indent level, which is
    an example of a property that would be different with different references to
    the same node.
  • Making the two utilities above easier/possible

TODO

  • Decide which prune to use; experiment with size
  • Add prune tests

@mlechu mlechu changed the title Ec/prune trees SyntaxGraph utils: unalias_nodes, annotate_parent, prune Aug 14, 2025
@c42f
Copy link
Owner

c42f commented Aug 15, 2025

I like this. Some high level comments:

  • prune - yep we absolutely need a way of compressing the provenance information for the sysimage and ji files. My gut feeling was that the original source code plus a linked set of byte range overlays representing provenance could work. But I don't have strong opinions about any of this yet.
  • annotate_parent! I admit I've been quite tempted to delete the parent field it from SyntaxNode and replace it with a cursor interface 😅 But I can also imagine cases where it's more efficient to annotate the whole tree with it. In any case it's interesting that you find it very useful.
  • unalias_nodes - ooh your typed_syntax prototype is extremely cool. I can see the desire to unalias when you want to do mutating operations. In JuliaLowering I originally used mutation in scope resolution to add the binding id, but quickly discovered this leads to the problems when desugaring creates a duplicate node in the output. After that I abandoned mutation and used the provenance system instead. I think this is a fairly general strategy: rather than mutating a node, you can just duplicate it and add an attribute, then point to the original node in the provenance info. Would that also work for type annotations? Sometimes it might be simpler to write a pass using mutation though.

@mlechu mlechu force-pushed the ec/core-hook branch 2 times, most recently from 10db904 to efe3b63 Compare August 15, 2025 23:45
Base automatically changed from ec/core-hook to main August 16, 2025 03:44
@mlechu mlechu changed the base branch from main to ec/graph-tweaks August 25, 2025 23:06
@mlechu mlechu force-pushed the ec/prune-trees branch 2 times, most recently from ef1749b to 5a1dbe8 Compare August 25, 2025 23:54
Base automatically changed from ec/graph-tweaks to main August 29, 2025 00:26
@mlechu
Copy link
Collaborator Author

mlechu commented Sep 4, 2025

Marking ready for review! Changes since the original version:

  • Decided to keep the unaliasing version of prune. Being able to remove the edges array in some future CompactSyntaxGraph looks to outweigh any space savings from the DAG representation (especially in the linear IR where there isn't much nesting). This future type would be immutable structure-wise, but trivial to "rehydrate" into a normal SyntaxGraph.
  • Removed annotate_parent!, since we wouldn't be using it internally and it's easy to implement if anyone wants it.

I won't optimize for space too much until we know what provenance will look like on disk, but this PR includes some easy savings without changes to SyntaxGraph.

julia> st0 = parsestmt(SyntaxTree, "begin; x1=1; x2=2; x3=3; x4=begin; 4; end; begin; end; end")

julia> st5 = JuliaLowering.lower(Module(), st0)

julia> stp = JuliaLowering.prune(st5)

julia> Base.summarysize(st5)
68452

julia> Base.summarysize(stp)
26360

Possible IR-specific improvements:

  • Deleting attributes that aren't semantically important to the IR or its .source
  • Storing required attributes more compactly
  • Improvements to SourceFile/SourceRef: currently calling prune with keep=nothing tends to increase the size even though fewer nodes are kept. This is because the parsed AST serves as a layer of indirection, and removing it causes each of the many nodes in the lowered tree to have a full SourceRef struct as an attribute.

@mlechu mlechu marked this pull request as ready for review September 4, 2025 21:28
@mlechu mlechu force-pushed the ec/prune-trees branch 2 times, most recently from d5f82aa to d29a684 Compare October 15, 2025 16:00
Exactly what information we want to prune in the final implementation is not
     determined, but `green_tree` is probably part of it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants