keywords as identifiers from .langium file? #1431

Fredderic · 2024-03-27T09:31:39Z

Fredderic
Mar 27, 2024

I'm trying to migrate a C-like language extension over to Langium — my hand-rolled parser is great, as long as the document syntax is corrent — not so good while you're actively writing it; it doesn't handle missing or extra tokens or any of that (though it does have proper error messages). It was also ported from another language at a time when I knew neither TypeScript, nor anything about implementing VS Code extensions — so it's in a less than ideal style. But I've run into the same problems as many other people, and can't seem to find a simple working solution anywhere (or an explanation of how they fixed it that makes sense). I was hoping once I can get the language parsing done, I can draw on the way Langium structures things as I bring over the rest of it (optimiser and debugger).

Typically in my hand-rolled parsers, I'll have a tokenizer consume anything that looks like an identifier, and then check those against a keywords list and change their type accordingly. Similarly for symbols, I consume the longest symbol I can match in the tokenizer (a regex of the symbol tokens, longest to shortest), emit them as "keyword"s also, and the parser just consumes a token stream (similarly, strings and numbers are emitted as whole tokens of their respective type), checking for simple exact string matched keyword tokens. It loses a little generality, but it seems to be extremely common for most programming grammars, and avoids a whole bunch of nasty lexing issues — and I'm wondering if Langium has a similar capability.

There's that longest_alt thing (demonstrated in keywords_vs_identifiers.js — which seems like a weird hacky way of doing the same job), but I can't figure out how to actually use it alongside a .langium file, plus, would it even work for symbol tokens too? I also saw Chevrotain can define custom token matchers (though again, using the parser.ts method), so is there a setting to just use one for all keywords; ie. every time Langium encounters a keyword string, it uses a specified matcher that attempts to parse an entire identifier, and then does a simple string compare of the result (more of a yacc/bison style) — would probably solve a lot of peoples issues. Also, custom matchers can add in a payload, such as the parsed value (inner-text of a string, the numeric form of a number, weirdness to work around JS not supporting -ve NaN's, etc), which I generally add into my tokenizers. Might even be worth while having a mechanism for .langium files to specify that keywords matching a given pattern, should use a specific matcher — imported from another .ts file. (Unless, of course, Langium just shoves everything into one big regex matcher, or something.)

Alternatively, is there a hook that will let me edit the generated list of keywords at generate or import time? That would allow me to sort them myself, tack on a "end of keyword" look-ahead, do the aforementioned custom matcher and pre-processing, or whatever else I need to do to make it actually work.

msujew · 2024-03-27T12:42:22Z

msujew
Mar 27, 2024
Maintainer

Hey @Fredderic,

we have a WIP guide on this, see eclipse-langium/langium-website#219. It outlines how to support keywords as identifiers.

Note that all of the Langium grammar to chevrotain token computation happens in an overridable service, the TokenBuilder. You might want to take a look at that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

keywords as identifiers from .langium file? #1431

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

keywords as identifiers from .langium file? #1431

Uh oh!

Fredderic Mar 27, 2024

Replies: 1 comment

Uh oh!

msujew Mar 27, 2024 Maintainer

Fredderic
Mar 27, 2024

msujew
Mar 27, 2024
Maintainer