You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
parser_stream: Produce green tree traversal rather than token ranges (#560)
## Background
I've written about 5 parsers that use the general red/tree green tree
pattern. Now that we're using JuliaSyntax in base, I'd like to replace
some of them by a version based on JuliaSyntax, so that I can avoid
having to multiple copies of similar infrastructure. As a result, I'm
taking a close look at some of the internals of JuliaSyntax.
## Current Design
One thing that I really like about JuliaSyntax is that the parser basically
produces a flat output buffer (well two in the current design, after
#19). In essence, the output
is a post-order depth-first traversal of the parse tree, each node annotated
with the range of covered by this range.
From there, it is possible to recover the parse tree without re-parsing
by partitioning the token list according to the ranges of the non-terminal
tokens. One particular application of this is to re-build a pointer-y green
tree structure that stores relative by ranges and serves the same incremental
parsing purpose as green tree representations in other system.
The single-output-buffer design is a great innovation over the pointer-y
system. It's much easier to handle and it also enforces important invariants
by construction (or at least makes them easy to check). However, I think
the whole post-parse tree construction logic is reducing the value of it
significantly. In particular, green trees are supposed to be able to serve
as compact, persistent representations of parse tree. However, here the
compact, persistent representation (the output memory buffer) is not usable
as a green tree. We do have the pointer-y `GreenNode` tree, but this has
all the same downsides that the single buffer system was supposed to avoid.
It uses explicit vectors in every node and even constructing it from the
parser output allocates a nontrivial amount of memory to recover the tree
structure.
## Proposed design
This PR proposed to change the parser output to be directly usable as a
green-tree in-situ by changing the post-order dfs traversal to instead
produce (byte, node) spans (note that this is the same data as in the
current `GreenNode`, except that the node span is implicit in the length
of the vector and that here the children are implicit by the position
in the output).
This does essentially mean semantically reverting #19,
but the representation proposed here is more compact than both main and
the pre-#19 representation. In particular, the output is now a sequence of:
```
struct RawGreenNode
head::SyntaxHead # Kind,flags
byte_span::UInt32 # Number of bytes covered by this range
# If NON_TERMINAL_FLAG is set, this is the total number of child nodes
# Otherwise this is a terminal node (i.e. a token) and this is orig_kind
node_span_or_orig_kind::UInt32
end
```
The structure is used for both terminals and non-terminals, with the iterpretation
differing between them for the last field. This is marginally more compact than
the current token list representation on current `main`, because we do not store
the `next_byte` pointer (which would instead have to be recovered from the green
tree using the usual `O(log n)` algorithm).
However, because we store `node_span`, this data structure provides linear time
traversal (in reverse order) over the children of the current ndoe. In particular,
this means that the tree structure is manifest and does not require the allocation
of temporary stacks to recover the tree structure. As a result, the output buffer
can now be used as an efficient, persistent, green tree representation.
I think the primary weird thing about this design is that the iteration over the
children must happen in reverse order. The current GreenNode design has constant
time access to all children. Of course, a lookup table for this can be computed
in linear time with smaller memory than GreenNode design, but it's
important to point out this limitation. That said, for transformation uses cases
(e.g. to Expr or Syntax node), constant time access to the children is not really
required (although the children are being produced backwards, which looks a little
funny). That said, to avoid any disruption to downstream users, the `GreenNode`
design itself is not changed to use this faster alternative. We can consider
doing so in a later PR.
## Benchmark
The motivation for this change is not performance, but rather representational cleanliness.
That said, it's of course imperative that this not degrade performance.
Fortunately, the benchmarks show that this is in fact marginally faster for `Expr`
construction, largely because we get to avoid the additional memory allocation traffic
from having the tree structure explicitly represented. Parse time itself is essentially
unchanged (which is unsurprising, since we're primarily changing what's being put into
the output - although the parser does a few lookback-style operations in a few places).
Co-authored-by: Em Chu <[email protected]>
0 commit comments