@@ -208,8 +208,124 @@ We want to encode both these cases in a way which is simplest for downstream
208
208
tools to use. This is an open question, but for now we use ` K"error" ` as the
209
209
kind, with the ` TRIVIA_FLAG ` set for unexpected syntax.
210
210
211
+ # Syntax trees
212
+
213
+ Julia's ` Expr ` abstract syntax tree can't store precise source locations or
214
+ deal with syntax trivia like whitespace or comments. So we need some new tree
215
+ types in ` JuliaSyntax ` .
216
+
217
+ JuliaSyntax currently deals in three types of trees:
218
+ * ` GreenNode ` is a minimal * lossless syntax tree* where
219
+ - Nodes store a kind and length in bytes, but no text
220
+ - Syntax trivia are included in the list of children
221
+ - Children are strictly in source order
222
+ * ` SyntaxNode ` is an * abstract syntax tree* which has
223
+ - An absolute position and pointer to the source text
224
+ - Children strictly in source order
225
+ - Leaf nodes store values, not text
226
+ - Trivia are ignored, but there is a 1:1 mapping of non-trivia nodes to the
227
+ associated ` GreenTree ` nodes.
228
+ * ` Expr ` is used as a conversion target for compatibility
229
+
230
+ Wherever possible, the tree structure of ` GreenNode ` /` SyntaxNode ` is 1:1 with
231
+ ` Expr ` . There are, however, some exceptions.
232
+
233
+ ## Tree differences between GreenNode and Expr
234
+
235
+ First, ` GreenNode ` inherently stores source position, so there's no need for
236
+ the ` LineNumberNode ` s used by ` Expr ` . There's also a small number of other
237
+ differences
211
238
212
- ### More about syntax kinds
239
+ ### Flattened generators
240
+
241
+ Flattened generators are uniquely problematic because the Julia AST doesn't
242
+ respect a key rule we normally expect: that the children of an AST node are a
243
+ * contiguous* range in the source text. This is because the ` for ` s in
244
+ ` [xy for x in xs for y in ys] ` are parsed in the normal order of a for loop to
245
+ mean
246
+
247
+ ```
248
+ for x in xs
249
+ for y in ys
250
+ push!(xy, collection)
251
+ ```
252
+
253
+ so the ` xy ` prefix is in the * body* of the innermost for loop. Following this,
254
+ the standard Julia AST is like so:
255
+
256
+ ```
257
+ (flatten
258
+ (generator
259
+ (generator
260
+ xy
261
+ (= y ys))
262
+ (= x xs)))
263
+ ```
264
+
265
+ however, note that if this tree were flattened, the order would be
266
+ ` (xy) (y in ys) (x in xs) ` and the ` x ` and ` y ` iterations are * opposite* of the
267
+ source order.
268
+
269
+ However, our green tree is strictly source-ordered, so we must deviate from the
270
+ Julia AST. The natural representation seems to be to remove the generators and
271
+ use a flattened structure:
272
+
273
+ ```
274
+ (flatten
275
+ xy
276
+ (= x xs)
277
+ (= y ys))
278
+ ```
279
+
280
+ ### Whitespace trivia inside strings
281
+
282
+ For triple quoted strings, the indentation isn't part of the string data so
283
+ should also be excluded from the string content within the green tree. That is,
284
+ it should be treated as separate whitespace trivia tokens. With this separation
285
+ things like formatting should be much easier. The same reasoning goes for
286
+ escaping newlines and following whitespace with backslashes in normal strings.
287
+
288
+ Detecting string trivia during parsing means that string content is split over
289
+ several tokens. Here we wrap these in the K"string" kind (as is already used
290
+ for interpolations). The individual chunks can then be reassembled during Expr
291
+ construction. (A possible alternative might be to reuse the K"String" and
292
+ K"CmdString" kinds for groups of string chunks (without interpolation).)
293
+
294
+ Take as an example the following Julia fragment.
295
+
296
+ ``` julia
297
+ x = """
298
+ $a
299
+ b"""
300
+ ```
301
+
302
+ Here this is parsed as ` (= x (string-s a "\n" "b")) ` (the ` -s ` flag in
303
+ ` string-s ` means "triple quoted string")
304
+
305
+ Looking at the green tree, we see the indentation before the ` $a ` and ` b ` are
306
+ marked as trivia:
307
+
308
+ ```
309
+ julia> text = "x = \"\"\"\n \$a\n b\"\"\""
310
+ show(stdout, MIME"text/plain"(), parseall(GreenNode, text, rule=:statement), text)
311
+ 1:23 │[=]
312
+ 1:1 │ Identifier ✔ "x"
313
+ 2:2 │ Whitespace " "
314
+ 3:3 │ = "="
315
+ 4:4 │ Whitespace " "
316
+ 5:23 │ [string]
317
+ 5:7 │ """ "\"\"\""
318
+ 8:8 │ String "\n"
319
+ 9:12 │ Whitespace " "
320
+ 13:13 │ $ "\$"
321
+ 14:14 │ Identifier ✔ "a"
322
+ 15:15 │ String ✔ "\n"
323
+ 16:19 │ Whitespace " "
324
+ 20:20 │ String ✔ "b"
325
+ 21:23 │ """ "\"\"\""
326
+ ```
327
+
328
+ ## More about syntax kinds
213
329
214
330
We generally track the type of syntax nodes with a syntax "kind", stored
215
331
explicitly in each node an integer tag. This effectively makes the node type a
@@ -239,6 +355,7 @@ There's arguably a few downsides:
239
355
processes one specific kind but for generic code processing many kinds
240
356
having a generic but * concrete* data layout should be faster.
241
357
358
+
242
359
# Differences from the flisp parser
243
360
244
361
Practically the flisp parser is not quite a classic [ recursive descent
@@ -360,47 +477,6 @@ parsing `key=val` pairs inside parentheses.
360
477
` kw ` for keywords.
361
478
362
479
363
- ### Flattened generators
364
-
365
- Flattened generators are uniquely problematic because the Julia AST doesn't
366
- respect a key rule we normally expect: that the children of an AST node are a
367
- * contiguous* range in the source text. This is because the ` for ` s in
368
- ` [xy for x in xs for y in ys] ` are parsed in the normal order of a for loop to
369
- mean
370
-
371
- ```
372
- for x in xs
373
- for y in ys
374
- push!(xy, collection)
375
- ```
376
-
377
- so the ` xy ` prefix is in the * body* of the innermost for loop. Following this,
378
- the standard Julia AST is like so:
379
-
380
- ```
381
- (flatten
382
- (generator
383
- (generator
384
- xy
385
- (= y ys))
386
- (= x xs)))
387
- ```
388
-
389
- however, note that if this tree were flattened, the order would be
390
- ` (xy) (y in ys) (x in xs) ` and the ` x ` and ` y ` iterations are * opposite* of the
391
- source order.
392
-
393
- However, our green tree is strictly source-ordered, so we must deviate from the
394
- Julia AST. The natural representation seems to be to remove the generators and
395
- use a flattened structure:
396
-
397
- ```
398
- (flatten
399
- xy
400
- (= x xs)
401
- (= y ys))
402
- ```
403
-
404
480
### Other oddities
405
481
406
482
* Operators with suffices don't seem to always be parsed consistently as the
0 commit comments