Skip to content

Commit d10b4a4

Browse files
committed
Add notes about different tree types
1 parent 838b4a3 commit d10b4a4

File tree

2 files changed

+120
-44
lines changed

2 files changed

+120
-44
lines changed

README.md

Lines changed: 118 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,124 @@ We want to encode both these cases in a way which is simplest for downstream
208208
tools to use. This is an open question, but for now we use `K"error"` as the
209209
kind, with the `TRIVIA_FLAG` set for unexpected syntax.
210210

211+
# Syntax trees
212+
213+
Julia's `Expr` abstract syntax tree can't store precise source locations or
214+
deal with syntax trivia like whitespace or comments. So we need some new tree
215+
types in `JuliaSyntax`.
216+
217+
JuliaSyntax currently deals in three types of trees:
218+
* `GreenNode` is a minimal *lossless syntax tree* where
219+
- Nodes store a kind and length in bytes, but no text
220+
- Syntax trivia are included in the list of children
221+
- Children are strictly in source order
222+
* `SyntaxNode` is an *abstract syntax tree* which has
223+
- An absolute position and pointer to the source text
224+
- Children strictly in source order
225+
- Leaf nodes store values, not text
226+
- Trivia are ignored, but there is a 1:1 mapping of non-trivia nodes to the
227+
associated `GreenTree` nodes.
228+
* `Expr` is used as a conversion target for compatibility
229+
230+
Wherever possible, the tree structure of `GreenNode`/`SyntaxNode` is 1:1 with
231+
`Expr`. There are, however, some exceptions.
232+
233+
## Tree differences between GreenNode and Expr
234+
235+
First, `GreenNode` inherently stores source position, so there's no need for
236+
the `LineNumberNode`s used by `Expr`. There's also a small number of other
237+
differences
211238

212-
### More about syntax kinds
239+
### Flattened generators
240+
241+
Flattened generators are uniquely problematic because the Julia AST doesn't
242+
respect a key rule we normally expect: that the children of an AST node are a
243+
*contiguous* range in the source text. This is because the `for`s in
244+
`[xy for x in xs for y in ys]` are parsed in the normal order of a for loop to
245+
mean
246+
247+
```
248+
for x in xs
249+
for y in ys
250+
push!(xy, collection)
251+
```
252+
253+
so the `xy` prefix is in the *body* of the innermost for loop. Following this,
254+
the standard Julia AST is like so:
255+
256+
```
257+
(flatten
258+
(generator
259+
(generator
260+
xy
261+
(= y ys))
262+
(= x xs)))
263+
```
264+
265+
however, note that if this tree were flattened, the order would be
266+
`(xy) (y in ys) (x in xs)` and the `x` and `y` iterations are *opposite* of the
267+
source order.
268+
269+
However, our green tree is strictly source-ordered, so we must deviate from the
270+
Julia AST. The natural representation seems to be to remove the generators and
271+
use a flattened structure:
272+
273+
```
274+
(flatten
275+
xy
276+
(= x xs)
277+
(= y ys))
278+
```
279+
280+
### Whitespace trivia inside strings
281+
282+
For triple quoted strings, the indentation isn't part of the string data so
283+
should also be excluded from the string content within the green tree. That is,
284+
it should be treated as separate whitespace trivia tokens. With this separation
285+
things like formatting should be much easier. The same reasoning goes for
286+
escaping newlines and following whitespace with backslashes in normal strings.
287+
288+
Detecting string trivia during parsing means that string content is split over
289+
several tokens. Here we wrap these in the K"string" kind (as is already used
290+
for interpolations). The individual chunks can then be reassembled during Expr
291+
construction. (A possible alternative might be to reuse the K"String" and
292+
K"CmdString" kinds for groups of string chunks (without interpolation).)
293+
294+
Take as an example the following Julia fragment.
295+
296+
```julia
297+
x = """
298+
$a
299+
b"""
300+
```
301+
302+
Here this is parsed as `(= x (string-s a "\n" "b"))` (the `-s` flag in
303+
`string-s` means "triple quoted string")
304+
305+
Looking at the green tree, we see the indentation before the `$a` and `b` are
306+
marked as trivia:
307+
308+
```
309+
julia> text = "x = \"\"\"\n \$a\n b\"\"\""
310+
show(stdout, MIME"text/plain"(), parseall(GreenNode, text, rule=:statement), text)
311+
1:23 │[=]
312+
1:1 │ Identifier ✔ "x"
313+
2:2 │ Whitespace " "
314+
3:3 │ = "="
315+
4:4 │ Whitespace " "
316+
5:23 │ [string]
317+
5:7 │ """ "\"\"\""
318+
8:8 │ String "\n"
319+
9:12 │ Whitespace " "
320+
13:13 │ $ "\$"
321+
14:14 │ Identifier ✔ "a"
322+
15:15 │ String ✔ "\n"
323+
16:19 │ Whitespace " "
324+
20:20 │ String ✔ "b"
325+
21:23 │ """ "\"\"\""
326+
```
327+
328+
## More about syntax kinds
213329

214330
We generally track the type of syntax nodes with a syntax "kind", stored
215331
explicitly in each node an integer tag. This effectively makes the node type a
@@ -239,6 +355,7 @@ There's arguably a few downsides:
239355
processes one specific kind but for generic code processing many kinds
240356
having a generic but *concrete* data layout should be faster.
241357

358+
242359
# Differences from the flisp parser
243360

244361
Practically the flisp parser is not quite a classic [recursive descent
@@ -360,47 +477,6 @@ parsing `key=val` pairs inside parentheses.
360477
`kw` for keywords.
361478

362479

363-
### Flattened generators
364-
365-
Flattened generators are uniquely problematic because the Julia AST doesn't
366-
respect a key rule we normally expect: that the children of an AST node are a
367-
*contiguous* range in the source text. This is because the `for`s in
368-
`[xy for x in xs for y in ys]` are parsed in the normal order of a for loop to
369-
mean
370-
371-
```
372-
for x in xs
373-
for y in ys
374-
push!(xy, collection)
375-
```
376-
377-
so the `xy` prefix is in the *body* of the innermost for loop. Following this,
378-
the standard Julia AST is like so:
379-
380-
```
381-
(flatten
382-
(generator
383-
(generator
384-
xy
385-
(= y ys))
386-
(= x xs)))
387-
```
388-
389-
however, note that if this tree were flattened, the order would be
390-
`(xy) (y in ys) (x in xs)` and the `x` and `y` iterations are *opposite* of the
391-
source order.
392-
393-
However, our green tree is strictly source-ordered, so we must deviate from the
394-
Julia AST. The natural representation seems to be to remove the generators and
395-
use a flattened structure:
396-
397-
```
398-
(flatten
399-
xy
400-
(= x xs)
401-
(= y ys))
402-
```
403-
404480
### Other oddities
405481

406482
* Operators with suffices don't seem to always be parsed consistently as the

src/parse_stream.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -599,7 +599,7 @@ end
599599

600600
"""
601601
build_tree(::Type{NodeType}, stream::ParseStream;
602-
wrap_toplevel_as_kind=nothing)
602+
wrap_toplevel_as_kind=nothing, kws...)
603603
604604
Construct a tree with `NodeType` nodes from a ParseStream using depth-first
605605
traversal. `NodeType` must have the constructors
@@ -616,7 +616,7 @@ a bottom-up tree builder interface similar to rust-analyzer. (In that case we'd
616616
traverse the list of ranges backward rather than forward.)
617617
"""
618618
function build_tree(::Type{NodeType}, stream::ParseStream;
619-
wrap_toplevel_as_kind=nothing) where NodeType
619+
wrap_toplevel_as_kind=nothing, kws...) where NodeType
620620
stack = Vector{NamedTuple{(:range,:node),Tuple{TaggedRange,NodeType}}}()
621621
for (span_index, range) in enumerate(stream.ranges)
622622
if kind(range) == K"TOMBSTONE"

0 commit comments

Comments
 (0)