Skip to content

Commit e87e8a4

Browse files
committed
README tweaks, add some links
1 parent 5b1c479 commit e87e8a4

File tree

1 file changed

+47
-26
lines changed

1 file changed

+47
-26
lines changed

README.md

Lines changed: 47 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,13 @@ A Julia frontend, written in Julia.
1616

1717
### Design Opinions
1818

19-
* Parser implementation should be independent from tree data structures so
19+
* Parser implementation should be independent from tree data structures. So
2020
we have the `ParseStream` interface.
2121
* Tree data structures should be *layered* to balance losslessness with
2222
abstraction and generality. So we have `SyntaxNode` (an AST) layered on top
2323
of `GreenNode` (a lossless parse tree). We might need other tree types later.
24-
* Fancy parser generators are marginal for production compilers. We use a
25-
boring but flexible recursive descent parser.
24+
* Fancy parser generators still seem marginal for production compilers. We use
25+
a boring but flexible recursive descent parser.
2626

2727
# Examples
2828

@@ -118,8 +118,9 @@ We use a version of [Tokenize.jl](https://github.com/JuliaLang/Tokenize.jl)
118118
which has been modified to better match the needs of parsing:
119119
* Newline-containing whitespace is emitted as a separate kind
120120
* Tokens inside string interpolations are emitted separately from the string
121-
* Strings delimiters are separate tokens and the `String` kind
122-
* Additional contextural keywords (`as`, `var`, `doc`) have been added and
121+
* Strings delimiters are separate tokens and the actual string always has the
122+
`String` kind
123+
* Additional contextual keywords (`as`, `var`, `doc`) have been added and
123124
moved to a subcategory of keywords.
124125
* Nonterminal kinds were added (though these should probably be factored out again)
125126
* Various bugs fixed and additions for newer Julia versions
@@ -143,9 +144,10 @@ Parsing proceeds by recursive descent;
143144
examine tokens and `bump()` to consume them.
144145
* The parser produces a flat list of text spans as *output* using `bump()` to
145146
transfer tokens to the output and `position()`/`emit()` for nonterminal ranges.
146-
* Diagnostics are emitted as separate text span
147-
* Whitespace and comments are automatically `bump()`ed, with the exception of
148-
syntactically relevant newlines in space sensitive mode.
147+
* Diagnostics are emitted as separate text spans
148+
* Whitespace and comments are automatically `bump()`ed and don't need to be
149+
handled explicitly. The exception is syntactically relevant newlines in space
150+
sensitive mode.
149151
* Parser modes are passed down the call tree using `ParseState`.
150152

151153
The output spans track the byte range, a syntax "kind" stored as an integer
@@ -172,7 +174,7 @@ define `build_tree` for the AST type `SyntaxNode` and for normal Julia `Expr`.
172174

173175
### Error recovery
174176

175-
The goal of the parser is to produce well-formed heirarchical structure from
177+
The goal of the parser is to produce well-formed hierarchical structure from
176178
the source text. For interactive tools we need this to work even when the
177179
source text contains errors; it's the job of the parser to include the recovery
178180
heuristics to make this work.
@@ -278,7 +280,7 @@ name of compatibility, perhaps with a warning.)
278280
broken-looking AST like `(macrocall (. A (quote (. B @x))))`. It should
279281
probably be rejected.
280282
* Operator prefix call syntax doesn't work in the cases like `+(a;b,c)` where
281-
parameters are separated by commas. A tuple is produced instead.
283+
keyword parameters are separated by commas. A tuple is produced instead.
282284
* `const` and `global` allow chained assignment, but the right hand side is not
283285
constant. `a` const here but not `b`.
284286
```
@@ -292,7 +294,7 @@ name of compatibility, perhaps with a warning.)
292294
* In try-catch-finally, the `finally` clause is allowed before the `catch`, but
293295
always executes afterward. (Presumably was this a mistake? It seems pretty awful!)
294296
* When parsing `"[x \n\n ]"` the flisp parser gets confused, but `"[x \n ]"` is
295-
correctly parsed as `Expr(:vect)`
297+
correctly parsed as `Expr(:vect)` (maybe fixed in 1.7?)
296298
* `f(x for x in in xs)` is accepted, and parsed very strangely.
297299
* Octal escape sequences saturate rather than being reported as errors. Eg,
298300
`"\777"` results in `"\xff"`. This is inconsistent with
@@ -388,13 +390,13 @@ seems to be to flatten the generators:
388390

389391
### Other oddities
390392

391-
* Operators with sufficies don't seem to always be parsed consistently as the
393+
* Operators with suffices don't seem to always be parsed consistently as the
392394
same operator without a suffix. Unclear whether this is by design or mistake.
393395
For example, `[x +y] ==> (hcat x (+ y))`, but `[x +₁y] ==> (hcat (call +₁ x y))`
394396

395397
* `global const x=1` is normalized by the parser into `(const (global (= x 1)))`.
396-
I suppose this is somewhat useful for AST consumers, but it seems a bit weird
397-
and unnecessary.
398+
I suppose this is somewhat useful for AST consumers, but reversing the source
399+
order is pretty weird and inconvenient when moving to a lossless parser.
398400

399401
* `let` bindings might be stored in a block, or they might not be, depending on
400402
special cases:
@@ -413,21 +415,39 @@ seems to be to flatten the generators:
413415
Presumably because of the need to add a line number node in the flisp parser
414416
`if a xx elseif b yy end ==> (if a (block xx) (elseif (block b) (block yy)))`
415417

416-
* Spaces are alloweed between import dots — `import . .A` is allowed, and
418+
* Spaces are allowed between import dots — `import . .A` is allowed, and
417419
parsed the same as `import ..A`
418420

419421
* `import A..` produces `(import (. A .))` which is arguably nonsensical, as `.`
420422
can't be a normal identifier.
421423

422-
* When lexing raw strings, more than two backslashes are treated strangely at
423-
the end of the string: `raw"\\\\ "` contains four backslashes, whereas
424-
`raw"\\\\"` contains only two.
424+
* The raw string escaping rules are *super* confusing for backslashes near vs
425+
at the end of the string: `raw"\\\\ "` contains four backslashes, whereas
426+
`raw"\\\\"` contains only two. It's unclear whether anything can be done
427+
about this, however.
425428

426429
* In braces after macrocall, `@S{a b}` is invalid but both `@S{a,b}` and
427430
`@S {a b}` parse. Conversely, `@S[a b]` parses.
428431

429432
# Resources
430433

434+
## Julia issues
435+
436+
Here's a few links to relevant Julia issues. No doubt there's many more.
437+
438+
#### Macro expansion
439+
440+
* Automatic hygiene for macros https://github.com/JuliaLang/julia/pull/6910
441+
would be interesting to implement this in a new frontend.
442+
443+
#### Lowering
444+
445+
* A partial implementation of lowering in Julia https://github.com/JuliaLang/julia/pull/32201
446+
some of this should be ported.
447+
* The closure capture problem https://github.com/JuliaLang/julia/issues/15276
448+
would be interesting to see whether we can tackle some of the harder cases in
449+
a new implementation.
450+
431451
## C# Roslyn
432452

433453
[Persistence, façades and Roslyn’s red-green trees](https://ericlippert.com/2012/06/08/red-green-trees/)
@@ -437,7 +457,7 @@ seems to be to flatten the generators:
437457

438458
## Rust-analyzer
439459

440-
`rust-analyzer` seems to be very close to what I'm buildin here, and has come
460+
`rust-analyzer` seems to be very close to what I'm building here, and has come
441461
to the same conclusions on green tree layout with explicit trivia nodes. Their
442462
document on internals
443463
[here](https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md)
@@ -591,7 +611,7 @@ The simplest idea possible is to have:
591611
* Children are in source order
592612

593613

594-
Call represents a challange for the AST vs Green tree in terms of node
614+
Call represents a challenge for the AST vs Green tree in terms of node
595615
placement / iteration for infix operators vs normal prefix function calls.
596616

597617
- The normal problem of `a + 1` vs `+(a, 1)`
@@ -602,7 +622,7 @@ example with something like the normal Julia AST's iteration order.
602622

603623
### Abstract syntax tree
604624

605-
By pointing to green tree nodes, AST nodes become tracable back to the original
625+
By pointing to green tree nodes, AST nodes become traceable back to the original
606626
source.
607627

608628
Unlike most languages, designing a new AST is tricky because the existing
@@ -632,7 +652,7 @@ SourceString <: AbstractString
632652
```
633653

634654
Having source location attached to symbols would potentially solve most of the
635-
hygine problem. There's still the problem of macro helper functions which use
655+
hygiene problem. There's still the problem of macro helper functions which use
636656
symbol literals; we can't very well be changing the meaning of `:x`! Perhaps
637657
the trick there is to try capturing the current module at the location of the
638658
interpolation syntax. Eg, if you do `:(y + $x)`, lowering expands this to
@@ -695,7 +715,7 @@ function g()
695715
end
696716
```
697717

698-
It seems like ideal error recorvery would need to backtrack in this case. For
718+
It seems like ideal error recovery would need to backtrack in this case. For
699719
example:
700720

701721
- Pop back to the frame which was parsing `f()`
@@ -741,10 +761,11 @@ f(a,
741761
# Fun research questions
742762
743763
* Given source and syntax tree, can we regress/learn a generative model of
744-
indentiation from the syntax tree? Source formatting involves a big pile of
764+
indentation from the syntax tree? Source formatting involves a big pile of
745765
heuristics to get something which "looks nice"... and ML systems have become
746-
very good at heuristics. Also, we've got huge piles of traininig data — just
766+
very good at heuristics. Also, we've got huge piles of training data — just
747767
choose some high quality, tastefully hand-formatted libraries.
748768
749769
* Similarly, can we learn fast and reasonably accurate recovery heuristics for
750-
when the parser encounters broken syntax rather than hand-coding these?
770+
when the parser encounters broken syntax rather than hand-coding these? How
771+
do we set the parser up so that training works and inference is nonintrusive?

0 commit comments

Comments
 (0)