Skip to content

Rewrite MASM parser to produce a concrete syntax tree #1901

@bitwalker

Description

@bitwalker

There are a few goals we have that the current parser gets in the way of:

  1. It does not preserve trivia (whitespace, comments), which is critical for implementing a miden-format tool.
  2. It performs constant propagation throughout parsing, losing important context needed for an LSP. This too prevents us from implementing a miden-format tool, because the resulting output would have all the constants erased
  3. It provides poor error recovery at best. This is problematic for an LSP, and particularly integrating into text editors/IDEs, as the user experience is really bad.

To address these issues, the current parser needs to be rewritten to have the following characteristics:

  • It produces a concrete syntax tree (CST), which can then be converted to an abstract syntax tree (AST). CSTs preserve all of the textual trivia and tokens, and furthermore, allow parts to be invalid/missing while still being able to continue parsing the rest of the tree. For example, if you start typing a new procedure definition in the middle of a file, e.g. export.foo| (where | here represents the cursor), the file would fail to parse today because the syntax is invalid on its own. With a CST-based parser however, the parser can easily recover by ignoring the invalid subtree, and continuing with the rest of the file. The mechanics of this are too much to get into here, but suffice to say that this is the way to implement parsers that are designed to provide a smooth text editing/IDE experience.
  • Constant propagation needs to be deferred to assembly-time. We don't actually need to do it during parsing, but it simplified some things, so that is what we did. However, by doing so, we've hamstrung our ability to reformat MASM and provide more useful diagnostics related to constants.
  • We should move from using a parser generator (LALRPOP) to a handwritten recursive-descent parser. In order to properly implement error recovery over the CST, we need granular control over parsing that we just don't get with parser generators. This is actually a fairly trivial lift, as MASM has a very simple grammar, but the bigger task is ensuring that the new parser actually accepts partial syntax trees (i.e. it doesn't just bail at the first problem, it can parse even in situations like I described earlier, where a definition is only partially-complete).

There are some great tools for building parsers in this style, e.g. rowan. rust-analyzer actually uses rowan internally (in addition to its syntax and parser crates to complete the picture).

This is a prerequisite to building a proper MASM formatter and LSP.

Metadata

Metadata

Assignees

Labels

assemblyRelated to Miden assembly

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions