Replies: 1 comment
-
Hey @Fredderic, we have a WIP guide on this, see eclipse-langium/langium-website#219. It outlines how to support keywords as identifiers. Note that all of the Langium grammar to chevrotain token computation happens in an overridable service, the TokenBuilder. You might want to take a look at that. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to migrate a C-like language extension over to Langium — my hand-rolled parser is great, as long as the document syntax is corrent — not so good while you're actively writing it; it doesn't handle missing or extra tokens or any of that (though it does have proper error messages). It was also ported from another language at a time when I knew neither TypeScript, nor anything about implementing VS Code extensions — so it's in a less than ideal style. But I've run into the same problems as many other people, and can't seem to find a simple working solution anywhere (or an explanation of how they fixed it that makes sense). I was hoping once I can get the language parsing done, I can draw on the way Langium structures things as I bring over the rest of it (optimiser and debugger).
Typically in my hand-rolled parsers, I'll have a tokenizer consume anything that looks like an identifier, and then check those against a keywords list and change their type accordingly. Similarly for symbols, I consume the longest symbol I can match in the tokenizer (a regex of the symbol tokens, longest to shortest), emit them as "keyword"s also, and the parser just consumes a token stream (similarly, strings and numbers are emitted as whole tokens of their respective type), checking for simple exact string matched keyword tokens. It loses a little generality, but it seems to be extremely common for most programming grammars, and avoids a whole bunch of nasty lexing issues — and I'm wondering if Langium has a similar capability.
There's that
longest_alt
thing (demonstrated inkeywords_vs_identifiers.js
— which seems like a weird hacky way of doing the same job), but I can't figure out how to actually use it alongside a.langium
file, plus, would it even work for symbol tokens too? I also saw Chevrotain can define custom token matchers (though again, using theparser.ts
method), so is there a setting to just use one for all keywords; ie. every time Langium encounters a keyword string, it uses a specified matcher that attempts to parse an entire identifier, and then does a simple string compare of the result (more of a yacc/bison style) — would probably solve a lot of peoples issues. Also, custom matchers can add in a payload, such as the parsed value (inner-text of a string, the numeric form of a number, weirdness to work around JS not supporting -ve NaN's, etc), which I generally add into my tokenizers. Might even be worth while having a mechanism for.langium
files to specify that keywords matching a given pattern, should use a specific matcher — imported from another.ts
file. (Unless, of course, Langium just shoves everything into one big regex matcher, or something.)Alternatively, is there a hook that will let me edit the generated list of keywords at generate or import time? That would allow me to sort them myself, tack on a "end of keyword" look-ahead, do the aforementioned custom matcher and pre-processing, or whatever else I need to do to make it actually work.
Beta Was this translation helpful? Give feedback.
All reactions