Skip to content

Commit 6465cce

Browse files
authored
Rework JuliaSyntax.parse() public API
Rework JuliaSyntax.parse() public API `parse()` and `parseall()` were generally pretty inconvenient to use. This change reworks what I had called `parseall()` to be more similar to `Meta.parse()` and adds `parseall()` and `parseatom()` in analogy to the `Base.Meta` versions of these functions. The lower level function `parse!()` is provided to work with `ParseStream` for cases where more control is required.
1 parent e5c7603 commit 6465cce

File tree

8 files changed

+258
-253
lines changed

8 files changed

+258
-253
lines changed

README.md

Lines changed: 37 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -42,23 +42,24 @@ A talk from JuliaCon 2022 covered some aspects of this package.
4242
# Examples
4343

4444
Here's what parsing of a small piece of code currently looks like in various
45-
forms. We'll use the `parseall` convenience function to demonstrate, but
46-
there's also a more flexible parsing interface with `JuliaSyntax.parse()`.
45+
forms. We'll use the `JuliaSyntax.parse` function to demonstrate, there's also
46+
`JuliaSyntax.parse!` offering more fine-grained control.
4747

4848
First, a source-ordered AST with `SyntaxNode` (`call-i` in the dump here means
4949
the `call` has the infix `-i` flag):
5050

5151
```julia
52-
julia> parseall(SyntaxNode, "(x + y)*z", filename="foo.jl")
52+
julia> using JuliaSyntax: JuliaSyntax, SyntaxNode, GreenNode
53+
54+
julia> JuliaSyntax.parse(SyntaxNode, "(x + y)*z", filename="foo.jl")
5355
line:col│ byte_range │ tree │ file_name
54-
1:11:9 │[toplevel] │foo.jl
55-
1:11:9 │ [call-i]
56-
1:22:6 │ [call-i]
57-
1:22:2 │ x
58-
1:44:4+
59-
1:66:6 │ y
60-
1:88:8*
61-
1:99:9 │ z
56+
1:11:9 │[call-i] │foo.jl
57+
1:22:6 │ [call-i]
58+
1:22:2 │ x
59+
1:44:4+
60+
1:66:6 │ y
61+
1:88:8*
62+
1:99:9 │ z
6263
```
6364

6465
Internally this has a full representation of all syntax trivia (whitespace and
@@ -69,45 +70,43 @@ despite being important for parsing.
6970

7071
```julia
7172
julia> text = "(x + y)*z"
72-
greentree = parseall(GreenNode, text)
73-
1:9 │[toplevel]
74-
1:9 │ [call]
75-
1:1 │ (
76-
2:6 │ [call]
77-
2:2 │ Identifier ✔
78-
3:3 │ Whitespace
79-
4:4+
80-
5:5 │ Whitespace
81-
6:6 │ Identifier ✔
82-
7:7 │ )
83-
8:8*
84-
9:9 │ Identifier ✔
73+
greentree = JuliaSyntax.parse(GreenNode, text)
74+
1:9 │[call]
75+
1:1 │ (
76+
2:6 │ [call]
77+
2:2 │ Identifier ✔
78+
3:3 │ Whitespace
79+
4:4+
80+
5:5 │ Whitespace
81+
6:6 │ Identifier ✔
82+
7:7 │ )
83+
8:8*
84+
9:9 │ Identifier ✔
8585
```
8686

8787
`GreenNode` stores only byte ranges, but the token strings can be shown by
8888
supplying the source text string:
8989

9090
```julia
9191
julia> show(stdout, MIME"text/plain"(), greentree, text)
92-
1:9 │[toplevel]
93-
1:9 │ [call]
94-
1:1 │ ( "("
95-
2:6 │ [call]
96-
2:2 │ Identifier ✔ "x"
97-
3:3 │ Whitespace " "
98-
4:4+"+"
99-
5:5 │ Whitespace " "
100-
6:6 │ Identifier ✔ "y"
101-
7:7 │ ) ")"
102-
8:8*"*"
103-
9:9 │ Identifier ✔ "z"
92+
1:9 │[call]
93+
1:1 │ ( "("
94+
2:6 │ [call]
95+
2:2 │ Identifier ✔ "x"
96+
3:3 │ Whitespace " "
97+
4:4+"+"
98+
5:5 │ Whitespace " "
99+
6:6 │ Identifier ✔ "y"
100+
7:7 │ ) ")"
101+
8:8*"*"
102+
9:9 │ Identifier ✔ "z"
104103
```
105104

106105
Julia `Expr` can also be produced:
107106

108107
```julia
109-
julia> parseall(Expr, "(x + y)*z")
110-
:($(Expr(:toplevel, :((x + y) * z))))
108+
julia> JuliaSyntax.parse(Expr, "(x + y)*z")
109+
:((x + y) * z)
111110
```
112111

113112
# Using JuliaSyntax as the default parser

src/hooks.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ function _core_parser_hook(code, filename, lineno, offset, options)
157157
return Core.svec(nothing, last_byte(stream))
158158
end
159159
end
160-
parse(stream; rule=rule)
160+
parse!(stream; rule=rule)
161161
if rule === :statement
162162
bump_trivia(stream)
163163
end

src/parse_stream.jl

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -155,10 +155,28 @@ const NO_POSITION = ParseStreamPosition(0, 0)
155155

156156
#-------------------------------------------------------------------------------
157157
"""
158-
ParseStream provides an IO interface for the parser. It
159-
- Wraps the lexer with a lookahead buffer
160-
- Removes insignificant whitespace and comment tokens, shifting them into the
161-
output implicitly (newlines may be significant depending on `skip_newlines`)
158+
ParseStream(text::AbstractString, index::Integer=1; version=VERSION)
159+
ParseStream(text::IO; version=VERSION)
160+
ParseStream(text::Vector{UInt8}, index::Integer=1; version=VERSION)
161+
ParseStream(ptr::Ptr{UInt8}, len::Integer, index::Integer=1; version=VERSION)
162+
163+
Construct a `ParseStream` from input which may come in various forms:
164+
* An string (zero copy for `String` and `SubString`)
165+
* An `IO` object (zero copy for `IOBuffer`). The `IO` object must be seekable.
166+
* A buffer of bytes (zero copy). The caller is responsible for preserving
167+
buffers passed as `(ptr,len)`.
168+
169+
A byte `index` may be provided as the position to start parsing.
170+
171+
ParseStream provides an IO interface for the parser which provides lexing of
172+
the source text input into tokens, manages insignificant whitespace tokens on
173+
behalf of the parser, and stores output tokens and tree nodes in a pair of
174+
output arrays.
175+
176+
`version` (default `VERSION`) may be used to set the syntax version to
177+
any Julia version `>= v"1.0"`. We aim to parse all Julia syntax which has been
178+
added after v"1.0", emitting an error if it's not compatible with the requested
179+
`version`.
162180
"""
163181
mutable struct ParseStream
164182
# `textbuf` is a buffer of UTF-8 encoded text of the source code. This is a

src/parser_api.jl

Lines changed: 75 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -3,43 +3,6 @@
33
# This is defined separately from parser.jl so that:
44
# * parser.jl doesn't need to refer to any tree data structures
55
# * It's clear which parts are the public API
6-
#
7-
# What should the general parsing API look like? Some points to consider:
8-
#
9-
# * After parsing atoms or statements or most other internal rules, it's
10-
# usual to start in the middle of the input text and end somewhere else in
11-
# the middle of the input text. So we should taken an index for the start of
12-
# parsing and supply an index back to the caller after parsing.
13-
#
14-
# * `parseall` is a special case where we expect to consume all the input.
15-
# Perhaps this is the API which throws an error if we don't consume it all,
16-
# and doesn't accept an index as input?
17-
#
18-
# * The ParseStream is the fundamental interface which wraps the code string
19-
# and index up together for input and contains the output events, diagnostics
20-
# and current stream position after parsing. The user should potentially be
21-
# able to use this directly. It does, however assume a Julia-compatible token
22-
# stream.
23-
#
24-
# * It could be useful to support an IO-based interface so that users can parse
25-
# Julia code intermixed with other DSLs. Documenter.jl and string macros come
26-
# to mind as examples which could use this. A tricky part is deciding where
27-
# the input ends: For string macros this is done by the parser, but for
28-
# Documenter it's probably just done beforehand according to the Markdown
29-
# code block rules.
30-
#
31-
# * The API should have an interface where a simple string is passed in. How
32-
# does SourceFile relate to this?
33-
#
34-
# * It's neat for `parse` to be overloadable to produce various output data
35-
# structures; GreenNode, SyntaxNode, Expr, (etc?) in the same way that
36-
# Base.parse can be used for non-Julia code. (Heh... though
37-
# `Base.parse(Expr, "...")` would also make a certain amount of sense.)
38-
#
39-
# * What's the no-copy API look like? A String can be put into an IOBuffer via
40-
# unsafe_wrap(Vector{UInt8}, str) ... A SubString likewise. Also there's the
41-
# `codeunits` function to hold a GC-safe view of string data as an array (but
42-
# we can't use a Vector{UInt8})
436

447
struct ParseError <: Exception
458
source::SourceFile
@@ -65,39 +28,19 @@ Base.display_error(io::IO, err::ParseError, bt) = Base.showerror(io, err, bt)
6528

6629

6730
"""
68-
# Input and output:
69-
stream = parse(stream::ParseStream; kws...)
70-
(tree, diagnostics) = parse(TreeType, io::IOBuffer; kws...)
71-
(tree, diagnostics, index) = parse(TreeType, str::AbstractString, [index::Integer]; kws...)
72-
# Keywords
73-
parse(...; rule=:toplevel, version=VERSION, ignore_trivia=true)
74-
75-
Parse Julia source code from `input`, returning the output in a format
76-
compatible with `input`:
77-
78-
* When `input` is a `ParseStream`, the stream itself is returned and the
79-
`ParseStream` interface can be used to process the output.
80-
* When `input` is a seekable `IO` subtype, the output is `(tree, diagnostics)`.
81-
The buffer `position` will be set to the next byte of input.
82-
* When `input` is an `AbstractString, Integer`, or `Vector{UInt8}, Integer` the
83-
output is `(tree, diagnostics, index)`, where `index` (default 1) is the next
84-
byte of input.
31+
parse!(stream::ParseStream; rule=:toplevel)
32+
33+
Parse Julia source code from a [`ParseStream`](@ref) object. Output tree data
34+
structures may be extracted from `stream` with the [`build_tree`](@ref) function.
8535
8636
`rule` may be any of
87-
* `toplevel` (default) — parse a whole "file" of top level statements. In this
37+
* `:toplevel` (default) — parse a whole "file" of top level statements. In this
8838
mode, the parser expects to fully consume the input.
89-
* `statement` — parse a single statement, or statements separated by semicolons.
90-
* `atom` — parse a single syntax "atom": a literal, identifier, or
39+
* `:statement` — parse a single statement, or statements separated by semicolons.
40+
* `:atom` — parse a single syntax "atom": a literal, identifier, or
9141
parenthesized expression.
92-
93-
`version` (default `VERSION`) may be used to set the syntax version to
94-
any Julia version `>= v"1.0"`. We aim to parse all Julia syntax which has been
95-
added after v"1.0", emitting an error if it's not compatible with the requested
96-
`version`.
97-
98-
See also [`parseall`](@ref) for a simpler but less powerful interface.
9942
"""
100-
function parse(stream::ParseStream; rule::Symbol=:toplevel)
43+
function parse!(stream::ParseStream; rule::Symbol=:toplevel)
10144
ps = ParseState(stream)
10245
if rule === :toplevel
10346
parse_toplevel(ps)
@@ -111,56 +54,37 @@ function parse(stream::ParseStream; rule::Symbol=:toplevel)
11154
stream
11255
end
11356

114-
function parse(::Type{T}, io::IO;
115-
rule::Symbol=:toplevel, version=VERSION, kws...) where {T}
57+
"""
58+
parse!(TreeType, io::IO; rule=:toplevel, version=VERSION)
59+
60+
Parse Julia source code from a seekable `IO` object. The output is a tuple
61+
`(tree, diagnostics)`. When `parse!` returns, the stream `io` is positioned
62+
directly after the last byte which was consumed during parsing.
63+
"""
64+
function parse!(::Type{TreeType}, io::IO;
65+
rule::Symbol=:toplevel, version=VERSION, kws...) where {TreeType}
11666
stream = ParseStream(io; version=version)
117-
parse(stream; rule=rule)
118-
tree = build_tree(T, stream; kws...)
67+
parse!(stream; rule=rule)
68+
tree = build_tree(TreeType, stream; kws...)
11969
seek(io, last_byte(stream))
12070
tree, stream.diagnostics
12171
end
12272

123-
# Generic version of parse for all other cases where an index must be passed
124-
# back - ie strings and buffers
125-
function parse(::Type{T}, input...;
126-
rule::Symbol=:toplevel, version=VERSION, kws...) where {T}
127-
stream = ParseStream(input...; version=version)
128-
parse(stream; rule=rule)
129-
tree = build_tree(T, stream; kws...)
130-
tree, stream.diagnostics, last_byte(stream) + 1
131-
end
132-
133-
134-
"""
135-
parseall(TreeType, input...;
136-
rule=:toplevel,
137-
version=VERSION,
138-
ignore_trivia=true)
139-
140-
Experimental convenience interface to parse `input` as Julia code, emitting an
141-
error if the entire input is not consumed. `input` can be a string or any other
142-
valid input to the `ParseStream` constructor. By default `parseall` will ignore
143-
whitespace and comments before and after valid code but you can turn this off
144-
by setting `ignore_trivia=false`.
145-
146-
A `ParseError` will be thrown if any errors occurred during parsing.
147-
148-
See [`parse`](@ref) for a more complete and powerful interface to the parser,
149-
as well as a description of the `version` and `rule` keywords.
150-
"""
151-
function parseall(::Type{T}, input...; rule=:toplevel, version=VERSION,
152-
ignore_trivia=true, filename=nothing) where {T}
153-
stream = ParseStream(input...; version=version)
73+
function _parse(rule::Symbol, need_eof::Bool, ::Type{T}, text, index=1; version=VERSION,
74+
ignore_trivia=true, filename=nothing, ignore_warnings=false) where {T}
75+
stream = ParseStream(text, index; version=version)
15476
if ignore_trivia && rule != :toplevel
15577
bump_trivia(stream, skip_newlines=true)
15678
empty!(stream)
15779
end
158-
parse(stream; rule=rule)
159-
if (ignore_trivia && peek(stream, skip_newlines=true) != K"EndMarker") ||
160-
(!ignore_trivia && (peek(stream, skip_newlines=false, skip_whitespace=false) != K"EndMarker"))
161-
emit_diagnostic(stream, error="unexpected text after parsing $rule")
80+
parse!(stream; rule=rule)
81+
if need_eof
82+
if (ignore_trivia && peek(stream, skip_newlines=true) != K"EndMarker") ||
83+
(!ignore_trivia && (peek(stream, skip_newlines=false, skip_whitespace=false) != K"EndMarker"))
84+
emit_diagnostic(stream, error="unexpected text after parsing $rule")
85+
end
16286
end
163-
if any_error(stream.diagnostics)
87+
if any_error(stream.diagnostics) || (!ignore_warnings && !isempty(stream.diagnostics))
16488
throw(ParseError(stream, filename=filename))
16589
end
16690
# TODO: Figure out a more satisfying solution to the wrap_toplevel_as_kind
@@ -169,13 +93,51 @@ function parseall(::Type{T}, input...; rule=:toplevel, version=VERSION,
16993
# not absolute positions.
17094
# * Dropping it would be ok for SyntaxNode and Expr...
17195
tree = build_tree(T, stream; wrap_toplevel_as_kind=K"toplevel", filename=filename)
172-
if !isempty(stream.diagnostics)
173-
# Crudely format any warnings to the current logger.
174-
buf = IOBuffer()
175-
show_diagnostics(IOContext(buf, stdout), stream,
176-
SourceFile(sourcetext(stream, steal_textbuf=true), filename=filename))
177-
@warn Text(String(take!(buf)))
178-
end
179-
tree
96+
tree, last_byte(stream) + 1
18097
end
18198

99+
"""
100+
parse(TreeType, text, [index];
101+
version=VERSION,
102+
ignore_trivia=true,
103+
filename=nothing,
104+
ignore_warnings=false)
105+
106+
# Or, with the same arguments
107+
parseall(...)
108+
parseatom(...)
109+
110+
Parse Julia source code string `text` into a data structure of type `TreeType`.
111+
`parse` parses a single Julia statement, `parseall` parses top level statements
112+
at file scope and `parseatom` parses a single Julia identifier or other "syntax
113+
atom".
114+
115+
If `text` is passed without `index`, all the input text must be consumed and a
116+
tree data structure is returned. When an integer byte `index` is passed, a
117+
tuple `(tree, next_index)` will be returned containing the next index in `text`
118+
to resume parsing. By default whitespace and comments before and after valid
119+
code are ignored but you can turn this off by setting `ignore_trivia=false`.
120+
121+
`version` (default `VERSION`) may be used to set the syntax version to
122+
any Julia version `>= v"1.0"`. We aim to parse all Julia syntax which has been
123+
added after v"1.0", emitting an error if it's not compatible with the requested
124+
`version`.
125+
126+
Pass `filename` to set any file name information embedded within the output
127+
tree, if applicable. This will also annotate errors and warnings with the
128+
source file name.
129+
130+
A `ParseError` will be thrown if any errors or warnings occurred during
131+
parsing. To avoid exceptions due to warnings, use `ignore_warnings=true`.
132+
"""
133+
parse(::Type{T}, text::AbstractString; kws...) where {T} = _parse(:statement, true, T, text; kws...)[1]
134+
parseall(::Type{T}, text::AbstractString; kws...) where {T} = _parse(:toplevel, true, T, text; kws...)[1]
135+
parseatom(::Type{T}, text::AbstractString; kws...) where {T} = _parse(:atom, true, T, text; kws...)[1]
136+
137+
@eval @doc $(@doc parse) parseall
138+
@eval @doc $(@doc parse) parseatom
139+
140+
parse(::Type{T}, text::AbstractString, index::Integer; kws...) where {T} = _parse(:statement, false, T, text, index; kws...)
141+
parseall(::Type{T}, text::AbstractString, index::Integer; kws...) where {T} = _parse(:toplevel, false, T, text, index; kws...)
142+
parseatom(::Type{T}, text::AbstractString, index::Integer; kws...) where {T} = _parse(:atom, false, T, text, index; kws...)
143+

0 commit comments

Comments
 (0)