Skip to content

Commit 8f9692c

Browse files
committed
Finish guide
1 parent 128d64f commit 8f9692c

File tree

1 file changed

+98
-39
lines changed

1 file changed

+98
-39
lines changed

hugo/content/guides/multi-mode-lexing.md

Lines changed: 98 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -7,66 +7,96 @@ Many modern programming languages such as [JavaScript](https://developer.mozilla
77
They are a way to easily concatenate or interpolate string values while maintaining great code readability.
88
This guide will show you how to support template literals in Langium.
99

10+
For this specific example, our template literal starts and ends using backticks `` ` `` and are interupted by expressions that are wrapped in curly braces `{}`.
11+
So in our example, usage of template literals might look something like this:
12+
13+
```js
14+
println(`hello {name}!`);
15+
```
16+
17+
Conceptually, template strings work by reading a start terminal which starts with `` ` `` and ends with `{`,
18+
followed by an expression and then an end terminal which is effectively just the start terminal in reverse using `}` and `` ` ``.
19+
Since we don't want to restrict users to only a single expression in their template literals, we also need a "middle" terminal reading from `}` to `{`.
20+
Of course, there's also the option that a user only uses a template literal without any expressions in there.
21+
So we additionally need a "full" terminal that reads from the start of the literal all the way to the end in one go.
22+
23+
To achieve this, we will define a `TemplateLiteral` parser rule and a few terminals.
24+
These terminals will adhere to the requirements that we just defined.
25+
To make it a bit easier to read and maintain, we also define a special terminal fragment that we can reuse in all our terminal definitions:
26+
1027
```antlr
1128
TemplateLiteral:
1229
// Either just the full content
13-
content+=TemplateContent |
14-
// Or template string parts with expressions in between
30+
content+=TEMPLATE_LITERAL_FULL |
31+
// Or template literal parts with expressions in between
1532
(
16-
content+=TemplateContentStart
17-
content+=Expression?
33+
content+=TEMPLATE_LITERAL_START
34+
content+=Expression?
1835
(
19-
content+=TemplateContentMiddle
36+
content+=TEMPLATE_LITERAL_MIDDLE
2037
content+=Expression?
21-
)*
22-
content+=TemplateContentEnd
23-
);
38+
)*
39+
content+=TEMPLATE_LITERAL_END
40+
)
41+
;
2442
25-
TemplateContent returns TextLiteral:
26-
value=RICH_TEXT;
43+
terminal TEMPLATE_LITERAL_FULL:
44+
'`' IN_TEMPLATE_LITERAL* '`';
2745
28-
TemplateContentStart returns TextLiteral:
29-
value=RICH_TEXT_START;
46+
terminal TEMPLATE_LITERAL_START:
47+
'`' IN_TEMPLATE_LITERAL* '{';
3048
31-
TemplateContentMiddle returns TextLiteral:
32-
value=RICH_TEXT_INBETWEEN;
49+
terminal TEMPLATE_LITERAL_MIDDLE:
50+
'}' IN_TEMPLATE_LITERAL* '{';
3351
34-
TemplateContentEnd returns TextLiteral:
35-
value=RICH_TEXT_END;
52+
terminal TEMPLATE_LITERAL_END:
53+
'}' IN_TEMPLATE_LITERAL* '`';
3654
37-
terminal RICH_TEXT:
38-
'`' IN_RICH_TEXT* '`';
55+
// '{{' is handled in a special way so we can escape normal '{' characters
56+
// '``' is doing the same for the '`' character
57+
terminal fragment IN_TEMPLATE_LITERAL:
58+
/[^{`]|{{|``/;
59+
```
3960

40-
terminal RICH_TEXT_START:
41-
'`' IN_RICH_TEXT* '{';
61+
If we go ahead and start parsing files with these changes, most things should work as expected.
62+
However, depending on the structure of your existing grammar, some of these new terminals might be in conflict with existing terminals of your language.
63+
For example, if your language supports block statements, chaining multiple blocks together will make this issue apparent:
4264

43-
terminal RICH_TEXT_INBETWEEN:
44-
'}' IN_RICH_TEXT* '{';
65+
```js
66+
{
67+
console.log('hi');
68+
}
69+
{
70+
console.log('hello');
71+
}
72+
```
4573

46-
terminal RICH_TEXT_END:
47-
'}' IN_RICH_TEXT* '`';
74+
The `} ... {` block in this example won't be parsed as separate `}` and `{` tokens, but instead as a single `TEMPLATE_LITERAL_MIDDLE` token, resulting in a parser error due to the unexpected token.
75+
This doesn't make a lot of sense, since we aren't in the middle of a template literal at this point anyway.
76+
However, our lexer doesn't know yet that the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals are only allowed to show up within a `TemplateLiteral` rule.
77+
To rectify this, we will need to make use of lexer modes. They will give us the necessary context to know whether we're inside a template literal or outside of it.
78+
Depending on the current selected mode, we can lex different terminals. In our case, we want to exclude the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals.
4879

49-
terminal fragment IN_RICH_TEXT:
50-
/[^{`]|{{|``/;
51-
```
80+
The following implementation of a `TokenBuilder` will do the job for us. It creates two lexing modes, which are almost identical except for the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals.
81+
We will also need to make sure that the modes are switched based on the `TEMPLATE_LITERAL_START` and `TEMPLATE_LITERAL_END` terminals. We use `PUSH_MODE` and `POP_MODE` for this.
5282

5383
```ts
54-
import { DefaultTokenBuilder, Grammar, isTokenTypeArray, Keyword, TerminalRule } from "langium";
84+
import { DefaultTokenBuilder, isTokenTypeArray, GrammarAST } from "langium";
5585
import { IMultiModeLexerDefinition, TokenType, TokenVocabulary } from "chevrotain";
5686

5787
const REGULAR_MODE = 'regular_mode';
5888
const TEMPLATE_MODE = 'template_mode';
5989

6090
export class CustomTokenBuilder extends DefaultTokenBuilder {
6191

62-
override buildTokens(grammar: Grammar, options?: { caseInsensitive?: boolean }): TokenVocabulary {
92+
override buildTokens(grammar: GrammarAST.Grammar, options?: { caseInsensitive?: boolean }): TokenVocabulary {
6393
const tokenTypes = super.buildTokens(grammar, options);
6494

6595
if(isTokenTypeArray(tokenTypes)) {
66-
// Regular mode just drops rich text middle & end
96+
// Regular mode just drops template literal middle & end
6797
const regularModeTokens = tokenTypes
68-
.filter(token => !['RICH_TEXT_INBETWEEN','RICH_TEXT_END'].includes(token.name));
69-
// Template mode needs to exclude the '}' keyword, which causes confusion while lexing
98+
.filter(token => !['TEMPLATE_LITERAL_MIDDLE','TEMPLATE_LITERAL_END'].includes(token.name));
99+
// Template mode needs to exclude the '}' keyword
70100
const templateModeTokens = tokenTypes
71101
.filter(token => !['}'].includes(token.name));
72102

@@ -84,33 +114,62 @@ export class CustomTokenBuilder extends DefaultTokenBuilder {
84114
}
85115

86116
protected override buildKeywordToken(
87-
keyword: Keyword,
117+
keyword: GrammarAST.Keyword,
88118
terminalTokens: TokenType[],
89119
caseInsensitive: boolean
90120
): TokenType {
91121
let tokenType = super.buildKeywordToken(keyword, terminalTokens, caseInsensitive);
92122

93123
if (tokenType.name === '}') {
94-
// The default } token will use [RICH_TEXT_INBETWEEN, RICH_TEXT_END] as longer alts
124+
// The default } token will use [TEMPLATE_LITERAL_MIDDLE, TEMPLATE_LITERAL_END] as longer alts
95125
// We need to delete the LONGER_ALT, they are not valid for the regular lexer mode
96126
delete tokenType.LONGER_ALT;
97127
}
98-
99128
return tokenType;
100129
}
101130

102-
protected override buildTerminalToken(terminal: TerminalRule): TokenType {
131+
protected override buildTerminalToken(terminal: GrammarAST.TerminalRule): TokenType {
103132
let tokenType = super.buildTerminalToken(terminal);
104133

105134
// Update token types to enter & exit template mode
106-
if(tokenType.name === 'RICH_TEXT_START') {
135+
if(tokenType.name === 'TEMPLATE_LITERAL_START') {
107136
tokenType.PUSH_MODE = TEMPLATE_MODE;
108-
} else if(tokenType.name === 'RICH_TEXT_END') {
137+
} else if(tokenType.name === 'TEMPLATE_LITERAL_END') {
109138
tokenType.POP_MODE = true;
110139
}
111-
112140
return tokenType;
113141
}
142+
}
143+
```
144+
145+
With this change in place, the parser will work as expected. There is one last issue which we need to resolve in order to get everything working perfectly.
146+
When inspecting our AST, the `TemplateLiteral` object will contain strings with input artifacts in there (mainly `` ` ``, `{` and `}`).
147+
These aren't actually part of the semantic value of these strings, so we should get rid of them.
148+
We will need to create a custom `ValueConverter` and remove these artifacts:
149+
150+
```ts
151+
import { CstNode, GrammarAST, DefaultValueConverter, ValueType, convertString } from 'langium';
152+
153+
export class CustomValueConverter extends DefaultValueConverter {
114154

155+
protected override runConverter(rule: GrammarAST.AbstractRule, input: string, cstNode: CstNode): ValueType {
156+
if (rule.name.startsWith('TEMPLATE_LITERAL')) {
157+
// 'convertString' simply removes the first and last character of the input
158+
return convertString(input);
159+
} else {
160+
return super.runConverter(rule, input, cstNode);
161+
}
162+
}
115163
}
116164
```
165+
166+
Of course, let's not forget to bind all of these services:
167+
168+
```ts
169+
export const CustomModule = {
170+
parser: {
171+
TokenBuilder: () => new CustomTokenBuilder(),
172+
ValueConverter: () => new CustomValueConverter()
173+
},
174+
};
175+
```

0 commit comments

Comments
 (0)