-
Notifications
You must be signed in to change notification settings - Fork 9
Attempt to define the AST #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
test/misc_ir.jl
Outdated
| LoweringError: | ||
| ::T | ||
| └─┘ ── `::` must be written `value::type` outside function argument lists | ||
| └─┘ ── invalid syntax: unknown kind `::` or number of arguments (1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of these messages got better / clearer, but several seem to be somewhat worse now, in the sense that they either expose the user to more syntax- / lowering-isms (e.g. "bracescat" / "keyword-separating" / "kind") or refer to less specific constructs (e.g. "this syntax" vs. "`module`")
Do you think we can improve that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should tweak some of the errors. Some of the worse ones including ccall and friends are because pre-desugaring (random AST tweaks performed at the same time as macro expansion) changes them from identifiers to a different kind K"core", which we would need to check for everywhere. I think we should move away from pre-desugaring instead.
"keyword-separating" is an attempt at clarity given many valid uses of semicolons outside of calls
Errors like this one (reachable via keyboard mash and parse) should probably avoid syntax-isms, though exposing more of them was intentional in some cases. A lot of the vaguer errors will only be seen by macro writers, and we point at source text anyway, so repeating the exact text in the error is less necessary than before.
I can bring back the old error in this case. It doesn't feel super helpful given that I don't know what the "user" intended, but it does seem better than the default.
src/syntax_graph.jl
Outdated
| ([K"=" [K"call" _...] _...], run=if_valid_get_args(st[1]), when=!isnothing(run)) -> | ||
| "deprecated call-equals form with args $run" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need run? We have normal semicolon separated block syntax for a more general version of that, as long as we specify that any variables defined in the when clause are also available in the body:
([K"=" [K"call" _...] _...], when=(run=if_valid_get_args(st[1]); !isnothing(run)) ->
"deprecated call-equals form with args $run"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't needed, but removing it doesn't really simplify the implementation (basically when=(code; true)), and it's convenient for revise-repl-printf debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my thought was that I'd prefer to read code looking like the following rather than needing to use the special variable name run:
([K"=" [K"call" _...] _...], when=(args=if_valid_get_args(st[1]); !isnothing(args)) ->
"deprecated call-equals form with args $args"maybe I'm misunderstanding what it's for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with removing the assignment part
src/validation.jl
Outdated
| # one of: | ||
| # (as (importpath . . . x y z) ident) | ||
| # (importpath . . . x y z) | ||
| # where y, z may be quoted for no reason (do we need to allow this?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, yes, due at least to the .. operator (maybe there's no other cases):
import A.:(..)
In future syntax evolution, I'd like to argue that this should be deprecated in favor of
import A.var".."
Julia has other cases where "quoted for no reason" occurs and is allowed by lowering, but is semantically quite questionable and should be deprecated if possible. For example, defining methods for operators
function Base.:(+)(x, y)
body
end
Possibly we could actually fix this with the new parser. If not, we could specify that var should be used, as that's much more semantically reasonable than a random quote node where an identifier is expected.
|
The pattern matching here is cool, I think it could even be useful to get it in as a separate PR - maybe with a couple of uses in desugaring to test out the benefits. (I assume it's a zero cost abstraction? It certainly seems like it can be.) I also think it's excellent to have an executable specification of what's valid input AST. I love this. However, I do have some design misgivings about the details - it feels rather like a repetition of large parts of the matching code within desugaring, even though in a nicely condensed form. Also it contains (IIUC?) random pieces of logic which are currently in other passes, such as I like that you've put proper error reporting in here. That seems really important to make this practical for end users rather than macro authors - keeping in mind that non-macro-authors are the vast majority of Julia users. For error reporting I think we need the principle that "if the parser can produce a form, we should assume an end user will see the error", and go to the effort of trying to write good messages. With that in mind, there's a fair few cases here where I think the errors got worse because we now talk about the names of In the current state, I'm feeling this is ok, but also I feel like it doesn't hit satisfying tradeoffs in terms of avoiding duplication with the rest of lowering. However, I have a few thoughts about future design/implementation directions which might get us from "eh that's implicitly a lot of code duplication" to "wow this was a great improvement":
Even better if we can combine both of these ideas. |
|
You understand far better than me how lowering should work, so my comments might not be super helpful. But with that disclaimer, I'd still like to share my thoughts. I think the basic idea of this PR is to separate validation from transformation and have the validation act as a readable specification. But the implementation still ends up being recursive, and from a general user's perspective, I think this validation (even if it looks much easier to read than desugaring, probably thanks to the beautiful pattern matching system) would still be hard to use directly as a specification? In many cases, users will probably still rely on error messages.
So I agree with @c42f's point above. Also, if we think about implementing Claire's point (2), we'd likely need to make use of validation information during transformation. So, based on a rough implementation sketch I can think of right now, doing both in one place would probably result in a simpler implementation. But other points @mlechu brought up might need more thought.
Regarding this, my understanding is that doing validation first can help avoid unnecessary validation when recursive transformations happen in the transformation phase. From a performance perspective, though, I don't think the order of nodes explored will change, so there might not be a huge performance difference. Or maybe code generated by macros needs fundamentally different validation/transformation than regular source input? I do think we need to improve validation messages for invalid code generated from macro contexts. JL already keeps the macro expansion context, so I believe we can use that to dispatch appropriate messages even with a one-path validation implementation.
This point is very interesting. Does avoiding this problem mean we need to do extensive validation as a separate pass first? The simplest (naivest) solution I can think of is to add an attribute to nodes that have already been desugared and use that flag to decide if desugaring is still needed. However, we might need to think about how this changes the complexity compared to doing validation as a separate pass first. |
Pattern-matching can be faster than the equivalent
To be clear, my intention with this PR is to delete more than its weight in desugaring code. As satisfying as that would be now, we're actively trying to achieve parity with flisp, so I want to wait before doing that.
I may have failed to argue my point. I don't think validation and desugaring should happen in the same pass. Desugaring often needs to eat its own output, which doesn't need to follow the rules of its input, and doesn't need to go through validation. Squeezing desugaring into the framework in validation.jl would also be much more painful than bringing I don't think desugaring should be an AST specification; the aim is to have one AST specification that macro authors follow (or see errors about) and desugaring gets to assume its inputs follow. This isn't the same as keeping two specifications synchronized. If desugaring handles a superset of the specification, we don't need to care; the validator has run. This way, we don't get the "how did this fall through lowering?" or "macro author found a clever trick: produce a lowering-internal form" situations we know and love from flisp. There would be code duplication with this approach, but the current state is the most duplicated it can be (all destructuring and error checking implemented twice). Things will only shrink from here, especially if we manage to simplify the AST and desugaring.
I don't think we should do this, and would at least like to wait before trying it out. The benefit of partial lowering doesn't seem worth the cost of every node now being valid or an error node, which sounds like "two cases" but actually means "two cases, per place in the SyntaxTree, per lowering pass." I fear it would be like
Agree; some thoughts are in this conversation. I still need to bring some validation.jl errors up to baseline, but the goal is for |
I'm actually not arguing that validation and desugaring should happen in the same pass :-) Instead, what I'm hoping/dreaming of is using one canonical set of patterns to drive both desugaring and validation. For example, matching all the variants of function signatures is an annoying piece of code which both validation and desugaring need to do. It would be wonderful if they could both share the same matching code. The tricky thing about this vague dream is that matching comes in two parts (1) the pattern specification and (2) the actions to execute when patterns match. The patterns might be shared, but the actions very much are not. The big design unknown with this idea is whether there's a convenient way to factor the actions out separately. If not, the idea is doomed.
It currently does, though I feel this is not ideal. It might be feasible to stop doing this and there's a few ideas we could try. But it's definitely a bunch of work. Some ideas:
At the outset I favor the second of these ideas but it could be a lot of refactoring. Would need some prototyping to test how feasible it is. |
Right, so addressing this concern is what I mean by "structurally valid" in " |
I see, and agree this would be nice if a convenient factoring exists.
This is the goal! I think knowing what our input looks like will help. Producing surface AST and re-desugaring is convenient, but we could probably take a more direct route through desugaring in some places. Knowing what desugaring's output looks like would also help, but doesn't feel critical given that it's only a problem for us, and not macro authors. If we do end up wanting a similar tree specification for intermediate trees, it would make sense to take the alternate simple-errors approach from above and only run it in testing.
If we receive a tree If you mean to only allow |
Co-authored-by: Cody Tapscott <[email protected]>
Co-authored-by: Claire Foster <[email protected]> Co-authored-by: Cody Tapscott <[email protected]>
Desugaring is massive: in both flisp and JuliaLowering, it is larger in bytes
than all other lowering passes combined. Being the first pass after parsing and
macro-expansion, it's also the judge of whether or not an AST is structurally
valid, causing all sorts of issues:
It isn't readable despite it being the de-facto specification of the
lowerable AST. Combining this with any incompleteness or outdatedness in the
AST developer documentation, macro authors are left to guess what inputs
they'll see and outputs they can produce.
Desugaring is responsible for both checking input and transforming it. This
is because macros can produce anything, and parsing produces a superset of
lowering-recognized ASTs anyway. The checking here is especially unwanted
given that much of desugaring's input is output from other desugaring
functions (related to (4) below), which shouldn't go through the same level of
checking as arbitrary user input.
Desugaring is limited in its capacity to produce errors. It won't detect
multiple errors per pass, and doesn't point at more than one tree per error.
There's no reason it can't, but the benefit of upgraded errors isn't
clearly worth adding even more desugaring complexity.
How each expression gets through desugaring is generally unknown. In both
implementations, we often take a sledgehammer approach:
where it isn't clear if desugaring will terminate. (There have been cases
where it hasn't: Stackoverflow while attempting to show error due to malformed
:(...)syntax JuliaLang/julia#51572, Lowering of invalid.Expr loops forever JuliaLang/julia#57733)Likely due to (1) making this difficult to even plan to implement, syntax
evolution at the AST level is unimplemented.
This PR
The goal of this PR is to fix issues 1-3 and allow us to address issues 4 and 5
over time by pulling desugaring's "define and check the AST" responsibility into
a separate pass.
The new
valid_st1([vcx::Validation1Context] st::SyntaxTree)::Boolfunctioncheck that
stis a structurally valid input to lowering/output of macroexpansion. It and its callees serve three main purposes:
over time. Desugaring that assumes its input has passed
valid_st1canfocus on transformations instead of checking what it has at every step.
valid_st1is complete enough to pass all JuliaLowering tests, and I've set itup to run right before desugaring. Some error messages have changed (some a bit
better, others a bit worse). We can work on more helpful errors over time.
valid_st0is also provided (nearly the same AST, but quotes and macros areunexpanded). This is also necessary as a callee of
valid_st1since embeddedmoduleandtoplevelin a macro-expanded expression should not be expanded.These functions only check the JuliaSyntax/JuliaLowering-produced AST, which is
different from the Expr AST, but only by rearranging Expr in a few well-defined
cases. If we wanted a version checking the Expr AST, most of the current
version of validation.jl could be taken and used.
@stm"syntax tree match"This PR also contains a
SyntaxTreepattern-matching macro, inspired by racketmatch, but much simpler.This is to keep validation functions grammar-like, and to keep the amount of
validation code manageable. Much more detail is available in the docstring.
This macro is arguably even better for post-validation desugaring since the
structure of its inputs are known.
Here's an example of a desugaring helper we could refactor:
Assuming the AST is valid, we can reduce this to the following. Note that
adding the comparison case with ">:" is a bugfix.
For the curious, the relevant validation function looks something like:
RFC: Interaction with "strict mode"
If I understand the discussions on a potential strict mode correctly, this PR solves
a different problem. Strict mode would disallow specific inputs that can
produce undesirable behaviour (but not undesirable to enough people to be
completely disallowed). This PR aims to define the AST-space that strict mode
operates in.
This is still somewhat related to strict mode since
valid_st1would be theeasiest place to pass in options and enforce any AST-related rules. However, I
think we should avoid this for now given our improved grip on the AST structure.
Can we instead take all questionable or ambiguous syntax and disallow it in a
future AST version instead? See the section below.
RFC: Syntax evolution
I'd like to make non-macro-breaking AST evolution possible. I suggest we have
validation.jl serve as the specification of an AST version and write a pass
transforming each AST of version
nto versionn+1, which runs afterexpanding macros that expect an older AST. This assumes a syntax version is
defined at the project level and tied to a major version with no mixtures of
macros expecting different AST versions. We may want something more complex.
Note that many discussions on "syntax evolution" are about changes in parsing,
which would need a separate versioning mechanism. Maybe each parser version
targets one AST version, and the user-facing syntax version is just the parser
version?
Rust's editions are usually mentioned here. I'm not familiar with this
mechanism, but I know it's well-regarded and that I should do some research here
before making any more suggestions.
Potential next steps
need AST evolution as described above until macros start taking SyntaxTree, so
now is the time to break and improve things.
get from validation
that runs all code through
valid_st1before flisp lowering to make surevalid_st1actually allows all syntax we plan to support. I'm prepared toadd nasty we-should-have-never-allowed-this cases here.
few practical reasons made me wait:
to either live without pattern matching, or flisp-lower and write
Expr->SyntaxTree for lowered code.
Misc
Why now?
I think writing down our input language will help us achieve compatibility with
existing code. We'll likely see some crazy broken syntax in our long tail of
testing. Validation will let us properly disallow it with a useful error,
provide a place to write weird syntax down if we plan to support it, and
hopefully (through syntax evolution) let us disallow it in future versions.
Alternative approach
I considered an alternative
valid_st1that isn't responsible for generatingnice user-facing errors, which can simply return a boolean and describe errors
if it's easy to do so. This approach could keep
valid_st1simpler and moregrammar-like---we can just describe the set of valid ASTs, return false eagerly
on error, and not waste bytes figuring out or reporting what exactly went wrong.
This would be optimal for a human-readable specification. However, bytes not
wasted on disambiguating and describing errors still need to exist in the first
lowering pass, and we wouldn't get most of the benefits of this PR.
Still, I'd like to keep the main part of each validation function free of
error-determination code to keep it more like a grammar, even if it means
redoing some work in the error case. This is so someone using it as a reference
can immediately see what's allowed, and doesn't need to read it top-to-bottom.
I haven't succeeded in all cases.