@@ -16,13 +16,13 @@ A Julia frontend, written in Julia.
16
16
17
17
### Design Opinions
18
18
19
- * Parser implementation should be independent from tree data structures so
19
+ * Parser implementation should be independent from tree data structures. So
20
20
we have the ` ParseStream ` interface.
21
21
* Tree data structures should be * layered* to balance losslessness with
22
22
abstraction and generality. So we have ` SyntaxNode ` (an AST) layered on top
23
23
of ` GreenNode ` (a lossless parse tree). We might need other tree types later.
24
- * Fancy parser generators are marginal for production compilers. We use a
25
- boring but flexible recursive descent parser.
24
+ * Fancy parser generators still seem marginal for production compilers. We use
25
+ a boring but flexible recursive descent parser.
26
26
27
27
# Examples
28
28
@@ -118,8 +118,9 @@ We use a version of [Tokenize.jl](https://github.com/JuliaLang/Tokenize.jl)
118
118
which has been modified to better match the needs of parsing:
119
119
* Newline-containing whitespace is emitted as a separate kind
120
120
* Tokens inside string interpolations are emitted separately from the string
121
- * Strings delimiters are separate tokens and the ` String ` kind
122
- * Additional contextural keywords (` as ` , ` var ` , ` doc ` ) have been added and
121
+ * Strings delimiters are separate tokens and the actual string always has the
122
+ ` String ` kind
123
+ * Additional contextual keywords (` as ` , ` var ` , ` doc ` ) have been added and
123
124
moved to a subcategory of keywords.
124
125
* Nonterminal kinds were added (though these should probably be factored out again)
125
126
* Various bugs fixed and additions for newer Julia versions
@@ -143,9 +144,10 @@ Parsing proceeds by recursive descent;
143
144
examine tokens and ` bump() ` to consume them.
144
145
* The parser produces a flat list of text spans as * output* using ` bump() ` to
145
146
transfer tokens to the output and ` position() ` /` emit() ` for nonterminal ranges.
146
- * Diagnostics are emitted as separate text span
147
- * Whitespace and comments are automatically ` bump() ` ed, with the exception of
148
- syntactically relevant newlines in space sensitive mode.
147
+ * Diagnostics are emitted as separate text spans
148
+ * Whitespace and comments are automatically ` bump() ` ed and don't need to be
149
+ handled explicitly. The exception is syntactically relevant newlines in space
150
+ sensitive mode.
149
151
* Parser modes are passed down the call tree using ` ParseState ` .
150
152
151
153
The output spans track the byte range, a syntax "kind" stored as an integer
@@ -172,7 +174,7 @@ define `build_tree` for the AST type `SyntaxNode` and for normal Julia `Expr`.
172
174
173
175
### Error recovery
174
176
175
- The goal of the parser is to produce well-formed heirarchical structure from
177
+ The goal of the parser is to produce well-formed hierarchical structure from
176
178
the source text. For interactive tools we need this to work even when the
177
179
source text contains errors; it's the job of the parser to include the recovery
178
180
heuristics to make this work.
@@ -278,7 +280,7 @@ name of compatibility, perhaps with a warning.)
278
280
broken-looking AST like ` (macrocall (. A (quote (. B @x)))) ` . It should
279
281
probably be rejected.
280
282
* Operator prefix call syntax doesn't work in the cases like ` +(a;b,c) ` where
281
- parameters are separated by commas. A tuple is produced instead.
283
+ keyword parameters are separated by commas. A tuple is produced instead.
282
284
* ` const ` and ` global ` allow chained assignment, but the right hand side is not
283
285
constant. ` a ` const here but not ` b ` .
284
286
```
@@ -292,7 +294,7 @@ name of compatibility, perhaps with a warning.)
292
294
* In try-catch-finally, the ` finally ` clause is allowed before the ` catch ` , but
293
295
always executes afterward. (Presumably was this a mistake? It seems pretty awful!)
294
296
* When parsing ` "[x \n\n ]" ` the flisp parser gets confused, but ` "[x \n ]" ` is
295
- correctly parsed as ` Expr(:vect) `
297
+ correctly parsed as ` Expr(:vect) ` (maybe fixed in 1.7?)
296
298
* ` f(x for x in in xs) ` is accepted, and parsed very strangely.
297
299
* Octal escape sequences saturate rather than being reported as errors. Eg,
298
300
` "\777" ` results in ` "\xff" ` . This is inconsistent with
@@ -388,13 +390,13 @@ seems to be to flatten the generators:
388
390
389
391
### Other oddities
390
392
391
- * Operators with sufficies don't seem to always be parsed consistently as the
393
+ * Operators with suffices don't seem to always be parsed consistently as the
392
394
same operator without a suffix. Unclear whether this is by design or mistake.
393
395
For example, ` [x +y] ==> (hcat x (+ y)) ` , but ` [x +₁y] ==> (hcat (call +₁ x y)) `
394
396
395
397
* ` global const x=1 ` is normalized by the parser into ` (const (global (= x 1))) ` .
396
- I suppose this is somewhat useful for AST consumers, but it seems a bit weird
397
- and unnecessary .
398
+ I suppose this is somewhat useful for AST consumers, but reversing the source
399
+ order is pretty weird and inconvenient when moving to a lossless parser .
398
400
399
401
* ` let ` bindings might be stored in a block, or they might not be, depending on
400
402
special cases:
@@ -413,21 +415,39 @@ seems to be to flatten the generators:
413
415
Presumably because of the need to add a line number node in the flisp parser
414
416
` if a xx elseif b yy end ==> (if a (block xx) (elseif (block b) (block yy))) `
415
417
416
- * Spaces are alloweed between import dots — ` import . .A ` is allowed, and
418
+ * Spaces are allowed between import dots — ` import . .A ` is allowed, and
417
419
parsed the same as ` import ..A `
418
420
419
421
* ` import A.. ` produces ` (import (. A .)) ` which is arguably nonsensical, as ` . `
420
422
can't be a normal identifier.
421
423
422
- * When lexing raw strings, more than two backslashes are treated strangely at
423
- the end of the string: ` raw"\\\\ " ` contains four backslashes, whereas
424
- ` raw"\\\\" ` contains only two.
424
+ * The raw string escaping rules are * super* confusing for backslashes near vs
425
+ at the end of the string: ` raw"\\\\ " ` contains four backslashes, whereas
426
+ ` raw"\\\\" ` contains only two. It's unclear whether anything can be done
427
+ about this, however.
425
428
426
429
* In braces after macrocall, ` @S{a b} ` is invalid but both ` @S{a,b} ` and
427
430
` @S {a b} ` parse. Conversely, ` @S[a b] ` parses.
428
431
429
432
# Resources
430
433
434
+ ## Julia issues
435
+
436
+ Here's a few links to relevant Julia issues. No doubt there's many more.
437
+
438
+ #### Macro expansion
439
+
440
+ * Automatic hygiene for macros https://github.com/JuliaLang/julia/pull/6910 —
441
+ would be interesting to implement this in a new frontend.
442
+
443
+ #### Lowering
444
+
445
+ * A partial implementation of lowering in Julia https://github.com/JuliaLang/julia/pull/32201 —
446
+ some of this should be ported.
447
+ * The closure capture problem https://github.com/JuliaLang/julia/issues/15276 —
448
+ would be interesting to see whether we can tackle some of the harder cases in
449
+ a new implementation.
450
+
431
451
## C# Roslyn
432
452
433
453
[ Persistence, façades and Roslyn’s red-green trees] ( https://ericlippert.com/2012/06/08/red-green-trees/ )
@@ -437,7 +457,7 @@ seems to be to flatten the generators:
437
457
438
458
## Rust-analyzer
439
459
440
- ` rust-analyzer ` seems to be very close to what I'm buildin here, and has come
460
+ ` rust-analyzer ` seems to be very close to what I'm building here, and has come
441
461
to the same conclusions on green tree layout with explicit trivia nodes. Their
442
462
document on internals
443
463
[ here] ( https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md )
@@ -591,7 +611,7 @@ The simplest idea possible is to have:
591
611
* Children are in source order
592
612
593
613
594
- Call represents a challange for the AST vs Green tree in terms of node
614
+ Call represents a challenge for the AST vs Green tree in terms of node
595
615
placement / iteration for infix operators vs normal prefix function calls.
596
616
597
617
- The normal problem of ` a + 1 ` vs ` +(a, 1) `
@@ -602,7 +622,7 @@ example with something like the normal Julia AST's iteration order.
602
622
603
623
### Abstract syntax tree
604
624
605
- By pointing to green tree nodes, AST nodes become tracable back to the original
625
+ By pointing to green tree nodes, AST nodes become traceable back to the original
606
626
source.
607
627
608
628
Unlike most languages, designing a new AST is tricky because the existing
@@ -632,7 +652,7 @@ SourceString <: AbstractString
632
652
```
633
653
634
654
Having source location attached to symbols would potentially solve most of the
635
- hygine problem. There's still the problem of macro helper functions which use
655
+ hygiene problem. There's still the problem of macro helper functions which use
636
656
symbol literals; we can't very well be changing the meaning of ` :x ` ! Perhaps
637
657
the trick there is to try capturing the current module at the location of the
638
658
interpolation syntax. Eg, if you do ` :(y + $x) ` , lowering expands this to
@@ -695,7 +715,7 @@ function g()
695
715
end
696
716
```
697
717
698
- It seems like ideal error recorvery would need to backtrack in this case. For
718
+ It seems like ideal error recovery would need to backtrack in this case. For
699
719
example:
700
720
701
721
- Pop back to the frame which was parsing ` f() `
@@ -741,10 +761,11 @@ f(a,
741
761
# Fun research questions
742
762
743
763
* Given source and syntax tree, can we regress/learn a generative model of
744
- indentiation from the syntax tree? Source formatting involves a big pile of
764
+ indentation from the syntax tree? Source formatting involves a big pile of
745
765
heuristics to get something which "looks nice"... and ML systems have become
746
- very good at heuristics. Also, we've got huge piles of traininig data — just
766
+ very good at heuristics. Also, we've got huge piles of training data — just
747
767
choose some high quality, tastefully hand-formatted libraries.
748
768
749
769
* Similarly, can we learn fast and reasonably accurate recovery heuristics for
750
- when the parser encounters broken syntax rather than hand-coding these?
770
+ when the parser encounters broken syntax rather than hand-coding these? How
771
+ do we set the parser up so that training works and inference is nonintrusive?
0 commit comments