Skip to content

Commit 36450b0

Browse files
Kenomlechu
andauthored
parser_stream: Produce green tree traversal rather than token ranges (#560)
## Background I've written about 5 parsers that use the general red/tree green tree pattern. Now that we're using JuliaSyntax in base, I'd like to replace some of them by a version based on JuliaSyntax, so that I can avoid having to multiple copies of similar infrastructure. As a result, I'm taking a close look at some of the internals of JuliaSyntax. ## Current Design One thing that I really like about JuliaSyntax is that the parser basically produces a flat output buffer (well two in the current design, after #19). In essence, the output is a post-order depth-first traversal of the parse tree, each node annotated with the range of covered by this range. From there, it is possible to recover the parse tree without re-parsing by partitioning the token list according to the ranges of the non-terminal tokens. One particular application of this is to re-build a pointer-y green tree structure that stores relative by ranges and serves the same incremental parsing purpose as green tree representations in other system. The single-output-buffer design is a great innovation over the pointer-y system. It's much easier to handle and it also enforces important invariants by construction (or at least makes them easy to check). However, I think the whole post-parse tree construction logic is reducing the value of it significantly. In particular, green trees are supposed to be able to serve as compact, persistent representations of parse tree. However, here the compact, persistent representation (the output memory buffer) is not usable as a green tree. We do have the pointer-y `GreenNode` tree, but this has all the same downsides that the single buffer system was supposed to avoid. It uses explicit vectors in every node and even constructing it from the parser output allocates a nontrivial amount of memory to recover the tree structure. ## Proposed design This PR proposed to change the parser output to be directly usable as a green-tree in-situ by changing the post-order dfs traversal to instead produce (byte, node) spans (note that this is the same data as in the current `GreenNode`, except that the node span is implicit in the length of the vector and that here the children are implicit by the position in the output). This does essentially mean semantically reverting #19, but the representation proposed here is more compact than both main and the pre-#19 representation. In particular, the output is now a sequence of: ``` struct RawGreenNode head::SyntaxHead # Kind,flags byte_span::UInt32 # Number of bytes covered by this range # If NON_TERMINAL_FLAG is set, this is the total number of child nodes # Otherwise this is a terminal node (i.e. a token) and this is orig_kind node_span_or_orig_kind::UInt32 end ``` The structure is used for both terminals and non-terminals, with the iterpretation differing between them for the last field. This is marginally more compact than the current token list representation on current `main`, because we do not store the `next_byte` pointer (which would instead have to be recovered from the green tree using the usual `O(log n)` algorithm). However, because we store `node_span`, this data structure provides linear time traversal (in reverse order) over the children of the current ndoe. In particular, this means that the tree structure is manifest and does not require the allocation of temporary stacks to recover the tree structure. As a result, the output buffer can now be used as an efficient, persistent, green tree representation. I think the primary weird thing about this design is that the iteration over the children must happen in reverse order. The current GreenNode design has constant time access to all children. Of course, a lookup table for this can be computed in linear time with smaller memory than GreenNode design, but it's important to point out this limitation. That said, for transformation uses cases (e.g. to Expr or Syntax node), constant time access to the children is not really required (although the children are being produced backwards, which looks a little funny). That said, to avoid any disruption to downstream users, the `GreenNode` design itself is not changed to use this faster alternative. We can consider doing so in a later PR. ## Benchmark The motivation for this change is not performance, but rather representational cleanliness. That said, it's of course imperative that this not degrade performance. Fortunately, the benchmarks show that this is in fact marginally faster for `Expr` construction, largely because we get to avoid the additional memory allocation traffic from having the tree structure explicitly represented. Parse time itself is essentially unchanged (which is unsurprising, since we're primarily changing what's being put into the output - although the parser does a few lookback-style operations in a few places). Co-authored-by: Em Chu <[email protected]>
1 parent eceaa39 commit 36450b0

19 files changed

+941
-624
lines changed

Project.toml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ version = "1.0.2"
77
Serialization = "1.0"
88
julia = "1.0"
99

10-
[deps]
11-
1210
[extras]
1311
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
1412
Serialization = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

docs/src/design.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -56,43 +56,47 @@ We use a hand-written lexer (a heavily modified version of
5656
The main parser innovation is the `ParseStream` interface which provides a
5757
stream-like I/O interface for writing the parser. The parser does not
5858
depend on or produce any concrete tree data structure as part of the parsing
59-
phase but the output spans can be post-processed into various tree data
59+
phase but the output nodes can be post-processed into various tree data
6060
structures as required. This is like the design of rust-analyzer though with a
6161
simpler implementation.
6262

6363
Parsing proceeds by recursive descent;
6464

6565
* The parser consumes a flat list of lexed tokens as *input* using `peek()` to
6666
examine tokens and `bump()` to consume them.
67-
* The parser produces a flat list of text spans as *output* using `bump()` to
68-
transfer tokens to the output and `position()`/`emit()` for nonterminal ranges.
67+
* The parser produces a flat list of `RawGreenNode`s as *output* using `bump()` to
68+
transfer tokens to the output and `position()`/`emit()` for nonterminal nodes.
6969
* Diagnostics are emitted as separate text spans
7070
* Whitespace and comments are automatically `bump()`ed and don't need to be
7171
handled explicitly. The exception is syntactically relevant newlines in space
7272
sensitive mode.
7373
* Parser modes are passed down the call tree using `ParseState`.
7474

75-
The output spans track the byte range, a syntax "kind" stored as an integer
76-
tag, and some flags. The kind tag makes the spans a [sum
77-
type](https://blog.waleedkhan.name/union-vs-sum-types/) but where the type is
78-
tracked explicitly outside of Julia's type system.
75+
The output nodes track the byte range, a syntax "kind" stored as an integer
76+
tag, and some flags. Each node also stores either the number of child nodes
77+
(for non-terminals) or the original token kind (for terminals). The kind tag
78+
makes the nodes a [sum type](https://blog.waleedkhan.name/union-vs-sum-types/)
79+
but where the type is tracked explicitly outside of Julia's type system.
7980

80-
For lossless parsing the output spans must cover the entire input text. Using
81+
For lossless parsing the output nodes must cover the entire input text. Using
8182
`bump()`, `position()` and `emit()` in a natural way also ensures that:
82-
* Spans are cleanly nested with children contained entirely within their parents
83-
* Siblings spans are emitted in source order
84-
* Parent spans are emitted after all their children.
83+
* Nodes are cleanly nested with children contained entirely within their parents
84+
* Sibling nodes are emitted in source order
85+
* Parent nodes are emitted after all their children.
8586

86-
These properties make the output spans naturally isomorphic to a
87+
These properties make the output nodes a post-order traversal of a
8788
["green tree"](#raw-syntax-tree--green-tree)
88-
in the terminology of C#'s Roslyn compiler.
89+
in the terminology of C#'s Roslyn compiler, with the tree structure
90+
implicit in the node spans.
8991

9092
### Tree construction
9193

92-
The `build_tree` function performs a depth-first traversal of the `ParseStream`
93-
output spans allowing it to be assembled into a concrete tree data structure,
94-
for example using the `GreenNode` data type. We further build on top of this to
95-
define `build_tree` for the AST type `SyntaxNode` and for normal Julia `Expr`.
94+
The `build_tree` function uses the implicit tree structure in the `ParseStream`
95+
output to assemble concrete tree data structures. Since the output is already
96+
a post-order traversal of `RawGreenNode`s with node spans encoding parent-child
97+
relationships, tree construction is straightforward. We build on top of this to
98+
define `build_tree` for various tree types including `GreenNode`, the AST type
99+
`SyntaxNode`, and for normal Julia `Expr`.
96100

97101
### Error recovery
98102

src/JuliaSyntax.jl

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ export @K_str, kind
7373

7474
export SyntaxNode
7575

76-
@_public GreenNode,
76+
@_public GreenNode, RedTreeCursor, GreenTreeCursor,
7777
span
7878

7979
# Helper utilities
@@ -95,7 +95,8 @@ include("parser_api.jl")
9595
include("literal_parsing.jl")
9696

9797
# Tree data structures
98-
include("green_tree.jl")
98+
include("tree_cursors.jl")
99+
include("green_node.jl")
99100
include("syntax_tree.jl")
100101
include("expr.jl")
101102

0 commit comments

Comments
 (0)