Token categories #11

ericvergnaud · 2023-12-15T20:34:45Z

ericvergnaud
Dec 15, 2023
Maintainer

From a parser point of view, apart from EOF, tokens do not convey any meaning.
A developer may think differently in that they may want their favorite IDE to decorate code based on token categories: literals, keywords, flow controls.
As of writing, most IDEs require tokens to be redefined and categorized in order to support syntax highlighting.
To facilitate generation of basic IDE support, it could be useful to categorize tokens.
Then a tool which only knows the IDE would be able to generate a basic editor.

A token category could be applied to a token as follows:

@category(control)
FOR: 'for';

Alternate proposals are welcome.

ftomassetti · 2023-12-28T07:15:19Z

ftomassetti
Dec 28, 2023
Collaborator

I would find it quite useful indeed.

The syntax used makes me think of annotations, and I think that a general mechanism for providing annotation would be useful.

Plugin writers could specify code that could receive as input the parse-tree obtained from parsing a grammar file with annotations and they could use that information to generate different things, including files for syntax highlighting.

0 replies

kaby76 · 2023-12-28T08:25:22Z

kaby76
Dec 28, 2023
Collaborator

Would annotations apply to parser rules?
An annotation is not directly relevant to the grammar. You have to decide whether it is an essential feature of parser generators or not.
A developer right now can take an Antlr4 grammar, and create annotations in a comment. My point is that it is just a coding standard.
The information of an annotation as you describe does not take into account context. This, again, is the main problem I have with Antlr Listeners and Visitors. In a grammar such as grammars-v4/java/java/, I may want to make a distinction between a defining occurrence of a token IDENTIFIER versus an applied occurrence of token IDENTIFIER. This requires a parse tree path.
I wrote an LSP server that took any Antlr4 grammar, with the caveat that it work with the CSharp target, and a configuration file containing a mapping of XPath expressions to a semantic token type, and can see the value of annotations. Unfortunately, the "categories" you mention in the annotation must be one from the semantic token types in the Spec. In my implementation, the computation in my "universal grammar LSP server" was computed post parse, but it could be computed in listener exit methods during the parse. You might want to peruse the example config file here to see how I specify semantic token markup.

6 replies

kaby76 Dec 28, 2023
Collaborator

NB: I actually think this is a good idea, but I'm playing devil's advocate here in trying to understand why it should be placed in the grammar file. There's a natural tension of placing things in the grammar file vs. outside the grammar file. A good example of that are Antlr actions vs. Antlr Listeners and Visitors.

The reason I think it's a good idea is because I think people will want a turn-key construction of LSP servers from a grammar. There have been many SO questions and Antlr4 Issues that revolve around this capability.

Tree-sitter has a similar notion in a highlights.scm file, e.g., for the C language. Here we see a similar idea, but the developers decided to place this specification in a separate file independent from the grammar (grammar.js). In the highlights.scm file, we see something that looks like a lisp-like expression syntax for tree matching. Personally, I don't like the tree matching pattern language in the highlights.scm file. There already is an ISO standard: XPath. In addition, tree-sitter uses other specifications for folding, defining and applied occurrences, and something called "textobjects".

ftomassetti Dec 29, 2023
Collaborator

Perhaps we could use another annotation to indicate which elements should be foldable.

I think that annotations could be a good compromise to provide extra information in place (which is more convenient that using external files, like they to in treesitter) while limiting the "pollution" of the grammar

kaby76 Dec 29, 2023
Collaborator

Perhaps we could use another annotation to indicate which elements should be foldable.

Let's define an "annotation" as a three-tuple that associates: (a) an annotation name; (b) an annotation string; (c) a parse tree node type or token type that is associated with the annotation.

Presumably, one could say @folding block : '{' statement* '}'; in a grammar like java. This could mean "folding can occur on block nodes".

How could I use annotations as a developer? Consider the grammars-v4/java/java/ grammar.

Let's assume for the moment that there are no restrictions on the string in @annotation( string ). Therefore, I could chose a XPath expression for the parse tree node. Its meaning could be: "start a search at the node type associated with the annotation for a set of nodes to mark @annotation".

What would a field declaration semantic token look like in the java grammar?

There are several rules I could annotate for the field name being declared.

In "fieldDeclaration"

@definer-field(variableDeclarators/variableDeclarator/variableDeclaratorId/identifier)
fieldDeclaration
: typeType variableDeclarators ';'
;

Here I am defining a @definer-field with parse tree nodes for identifier. I would assume (or I would have to write myself) the parse tree would be traversed for the expression variableDeclarators/variableDeclarator/variableDeclaratorId/identifier.
In "identifer"
Alternatively, an annotation could be associated with the identifier rule:

@definer-field(.[ancestor::variableDeclaratorId])
identifier
: IDENTIFIER
| MODULE
| OPEN
| REQUIRES
| EXPORTS
| OPENS
| TO
| USES
| PROVIDES
| WITH
| TRANSITIVE
| YIELD
| SEALED
| PERMITS
| RECORD
| VAR
;

Here, a parse tree walk could be triggered at an identifier node to check for an ancestor node being of type variableDeclaratorId.
In IDENTIFIER

@field-definer(.[ancestor::variableDeclaratorId])
IDENTIFIER: Letter LetterOrDigit*;

Note that in this case, we would need to define an annotation for other

If we add annotations, at the minimum there would have to be an API in the runtime to access the tuple. However, that's not very useful. If we assign a meaning to annotations, we can implement that in an LSP server. That would be much more useful.

ftomassetti Dec 31, 2023
Collaborator

I think that annotations are nice because they can be specified in place, and that could remove some reasons to use xpath or similar solutions

About the meaning: I think we could provide a generic mechanism for annotations and then have some well-known annotations that the standard tool itself could use. For example, the one for folding could be used to generate the LSP server, but maybe the one for token categories could be used from other tools generating the syntax highlighting rules for specific editors

kaby76 Dec 31, 2023
Collaborator

[Annotations] could remove some reasons to use xpath or similar solutions

I agree. An annotation could be written without an associated string.

@ control
FOR: 'for';

This associates annotation "control" for all parse tree nodes with a token type 'FOR' are annotated 'control'. There's no constraint on the parse tree structure for the node. In fact, the annotation could be created during the parser because there's no tree walking needed (which would be a problem during the parse because the full tree hasn't been constructed).

About the meaning: I think we could provide a generic mechanism for annotations and then have some well-known annotations that the standard tool itself could use. For example, the one for folding could be used to generate the LSP server, but maybe the one for token categories could be used from other tools generating the syntax highlighting rules for specific editors

I agree. An "official" Antlr5 tool could have an official set of annotations. But, the API should have something to turn off tagging of the "official" tool.

The problem is the "openness" of the grammar in the "official grammars-v5" repo. If I want "Ken's" annotations added to the "official" java/java/ grammar, it would be nice if the admins allow this. (The tree-sitter developers and I agree on at least one thing: a flat, unconstrained annotation is not going to be enough.)

If the grammars-v5 repo is "closed", then I might not have much use for annotations. I'll simply have read the annotations from a separate file, or generate a modified grammar that has my annotations. Perhaps this is why the tree-sitter folks have different files for annotations rather than integrate them into the grammar.js file.

By the way, the reason I would have my own annotations is that I would use then to debug a grammar using semantic highlighting rather then look at the token stream and a print out of the parse tree. It's much faster to visually debug the grammar when you it's colored.

ericvergnaud · 2024-01-02T08:56:06Z

ericvergnaud
Jan 2, 2024
Maintainer Author

Trying to sum up, and adding my pinch of salt:

there seems to be consensus that adding support for annotations would be great
annotations would be defined as ARONDBASE ns=IDENTIFIER COLON name=IDENTIFIER ( LPAR literal RPAR )? where literal can be a string, a number or a dict-like ( key-value pairs between curly braces)
these annotations would only be attached to top-level rules (excluding fragments)
annotation interpreters would be pluggable, such that 3rd-party annotations can be added on top of standard ones

What's unclear at this point is whether these annotations are available:

as part of the grammar only (for use in LSP and tools)
also available as part of the generated parse tree

I'd recommend the former in order to keep things small, and also because with a decently serialised parse tree, there are plenty of XPath-like tools out there that can be used to locate specific nodes.

But maybe I'm missing something ?

5 replies

kaby76 Jan 2, 2024
Collaborator

Sounds good.

I think my main concern is that if we only support "token categories" (or rather "annotations" because a parser rule can have them) only in one grammar file.

If there is "one official grammar" for Java in a grammars-v5 repo, will the "admins" add 3rd party annotations to the grammar? We haven't heard from @KvanTTT about allowing people to add in annotations into an official grammar in grammars-v5.

This is why I think it's important that we try to have a syntax that keeps a grammar "open". C# has "partial classes", which allows one to declare across two or more files a class with different methods. It's a fabulous feature in C#.

Maybe we can have two grammar files with parser rule "identifier" declared this way:

Java.g4:

...
identifier : IDENTIFIER | VAR | ... ;
....

JavaAnnotations.g4:

....
@ keyword  // any annotations, with "keyword" as an example.
identifier;
....

That way, a developer can "add in" annotations specific for his implementation and not "pollute" an official grammar for Java.

ericvergnaud Jan 2, 2024
Maintainer Author

Maybe we need to introduce 'annotating' grammars i.e. that do not introduce rules but rather add custom annotations to existing rules ?

Based on the above, an example could look as follows:

annotating grammar JavaAnnotations;

import JavaParser;

@ keyword  // any annotation, but only annotations
identifier;

ftomassetti Jan 2, 2024
Collaborator

these annotations would only be attached to top-level rules (excluding fragments)

I think annotations could be useful also on elements of the rules. Consider for example the folding annotation. It could be useful to have a default way to pick the open and close token, but to optionally pick a different one explicitly through an annotation.

@folding
classDecl: 'class' ID @foldingOpen '{' classMember* '}'; // by default the last token is the closing token, so no need to specify @foldingClose here

ftomassetti Jan 2, 2024
Collaborator

I think the idea of the annotating grammars is very interesting, as the point raised by @kaby76 makes sense. Maybe someone want to reuse an existing grammar but specify annotations that are meaningful for him/her, as they want to generate maybe they want to add annotations that a relevant only for a certain tool that is not widely used

KvanTTT Jan 3, 2024
Collaborator

I think annotations could be useful also on elements of the rules. Consider for example the folding annotation. It could be useful to have a default way to pick the open and close token, but to optionally pick a different one explicitly through an annotation.

Can't it be resolved on tokens level? With using something like OPEN_RP and CLOSE_RP lexer commands (see my comment below).

KvanTTT · 2024-01-03T15:03:16Z

KvanTTT
Jan 3, 2024
Collaborator

Why can't we use existing feature of lexer commands to mark tokens? I mean introducing a new command like category:

KEYWORD: 'keyword' -> category(KEYWORD);
KEYWORD2: 'keyword2' -> category(KEYWORD);
...
ID: -> category(ID);
...
NEWLINE: [ \r\n] -> channel(HIDDEN), category(SPLITTER);
...
LP: '(' -> category(OPEN_PARENTHESIS);
RP: ')' -> category(CLOSE_PARENTHESIS);
...
LINE_COMMENT: '//' ~[ \r\n]* -> channel(HIDDEN), category(COMMENT);

Category type can be itself different:

Custom-based (KEYWORD, COMMENT)
Intrinsic: SPLITTER is used for bulding full-fidelity parse tree where every normal token (in DEFAULT_CHANNEL has its leading and trailing HIDDEN tokens (see description of Syntax trivia in Roslyn for detail).
Paired (OPEN_* and CLOSE_*). Every CLOSE_N identifier should have OPEN_N counterpart. It can be used for building code-folding in text editor.

All categories are accessible in built tree and can be used if needed (for instance, for code highlighting).

5 replies

kaby76 Jan 3, 2024
Collaborator

Tagging for only lexer rules is a step drop-down in functionality, and pretty much useless for me. I need parser rule tagging.

The main problem I have with annotations on the right-hand side of a rule is that I cannot just "add in" on an existing rule. I don't know of the mechanisms on "overriding" rules.

ericvergnaud Jan 3, 2024
Maintainer Author

interesting idea...
but lexer commands are lexer only and we don't support custom commands, do we ?
more importantly, they are meant to affect the parsing. Not sure we want to introduce commands that don't affect parsing ?

ericvergnaud Jan 3, 2024
Maintainer Author

@kaby76 overriding rules is already supported in antlr4. It's a bit silent, but it's there.

KvanTTT Jan 4, 2024
Collaborator

but lexer commands are lexer only and we don't support custom commands, do we ?

Yes, currently they are lexer-only, but probably it's not a big problem to make them parse compatible as well. Custom commands are also not supported currently.

more importantly, they are meant to affect the parsing. Not sure we want to introduce commands that don't affect parsing ?

Commands suggested by me only affects minor stuff like HIDDEN tokens binding (SPLITTER), other commands don't affect lexer/parsing. But SPLITTER command is useful itself because it also can improve errors handling (for instance, lexer should close unterminated string literal at the end of line (it's SPLITTER element), but not at the end of the entire file).

ericvergnaud Jan 4, 2024
Maintainer Author

@KvanTTT Since this would be a 'step drop-down in functionality' I don't think it's going to make it - the annotation seems much more flexible. But if there is a need for a SPLITTER command then maybe open a dedicated discussion ?

kaby76 · 2024-02-16T15:35:30Z

kaby76
Feb 16, 2024
Collaborator

I have my LSP server for Antlr4 grammars now working again, and I'd like to explore how we intend to use annotations/token categories for markup for an LSP server.

For an Antlr4 grammar, parser and lexer rules are defined by:

parserRuleSpec
    : ruleModifiers? RULE_REF argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI
        exceptionGroup
    ;
...
lexerRuleSpec
    : FRAGMENT? TOKEN_REF optionsSpec? COLON lexerRuleBlock SEMI
    ;

Because LSP distinguishes between a "def" and a "ref" of a symbol, I can't summarily annotate a RULE_REF as always a parser rule. I have to distinguish between a "def" and a "ref" for a parser rule name. For a "def", it's the RULE_REF in this production, to the left of the ':'.

I could "annotate" these two rules (parserRuleSpec and lexerRuleSpec) at the RULE_REF and TOKEN_REF with a "definition" category for parser and lexer rules, respectively, something like parserRuleSpec : ruleModifiers? RULE_REF @@definition('parser symbol') argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI exceptionGroup ;. Unfortunately that begs the question: "Do I put the annotation before or after the RULE_REF in the production?

An alternative would be to annotate the RULE_REF and TOKEN_REF rules in the lexer grammar, as we discussed earlier in this thread. Unfortunately, I cannot annotate rules for RULE_REF and TOKEN_REF because the grammar actually does not define these rules--it's in some target-specific code!

Perhaps this is why tree-sitter puts this information in separate files, outside a grammar.

5 replies

ericvergnaud Feb 16, 2024
Maintainer Author

Would the templates implementation help solve this externally defined rules issue ?

kaby76 Feb 16, 2024
Collaborator

I see what you're getting at. We could have a way to annotate points in the grammar, but as far as what it means, defer that off to an external file like the .stg's. Interesting idea.

KvanTTT Feb 16, 2024
Collaborator

Could you clarify what these annotations are used for? Why just don't use RULE_REF node from parse tree?

KvanTTT Feb 16, 2024
Collaborator

Also, could element labels resolve you problem?

parserRuleSpec : ruleModifiers? parser_symbol=RULE_REF argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI exceptionGroup ;

kaby76 Feb 17, 2024
Collaborator

Could you clarify what these annotations are used for? Why just don't use RULE_REF node from parse tree?

An "annotation" is anything we want it to be. But, it's a way to define a set of nodes in a parse tree through the grammar. It is useful for LSP-based servers.

LSP is based on JSON-RCP. When you right-click in VSCode on a symbol, and then click on the "go to definition" in the pop-up context menu, the client makes a request to the server for symbols https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_definition. The server, which is based on an Antlr parser, returns the location of the "definitions for the symbol at the cursor".

A "def" for a parser rule symbol is not just any RULE_REF in a parse tree. It's specifically the one in the parse tree that occurs before a ':' in a rule like start : expression+ EOF;. Similarly, a "ref" is not any RULE_REF because it can occur in identifier. For the Antlr4 grammar, it is a RULE_REF nested under a ruleref parse tree node.

How do I currently specify "refs" and "defs" in the parse tree? I use an XPath expression. The expression identifies a set of parse tree nodes. So, a "def" would be //parserRuleSpec/RULE_REF, meaning "find any node from "root" of type parserRuleSpec, then take the direct children of parse node type RULE_REF.

LSP also supports other language-based functionality: "semantic highlighting" which is a colorization assignment to points in the parse tree; folding, which says a range can be "folded".

An "annotation" is a point in the grammar that corresponds to a parse tree node. If the annotation occurs before the left-hand side of a rule, all parse tree nodes of that node type are labeled with the annotation. Otherwise, the right-hand side node in the parse tree is annotated (not all of that parse node type).

For the Antlr4 grammar, we could define the "def" of a symbol with an annotation:

parserRuleSpec
    : ruleModifiers? def_parser_rule_symbol=RULE_REF argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI
        exceptionGroup
    ;

The syntax is not terribly important. An annotation can only work if that point in the rule corresponds the set and only the set of nodes in the parse tree that are equivalent to an XPath expression. For parserRuleSpec, the annotation at RULE_REF identifies the same set of nodes in the parse tree as //parserRuleSpec/RULE_REF.

An annotation can never be as expressive as an XPath expression. Take the grammar "Eh.g4", which is defined by thegrammar Eh; start : eh+ EOF; eh : 'Eh?'; . Let's define the "def" of the "Eh?" string in the input, and a "ref" as any "Eh?" after the first. You can do this with XPath (defs = //eh[1], refs=//eh[position()>1]). But, you cannot define an annotation to specify the "defs" and "refs" for this language using this grammar. The grammar has to be refactored.

Token categories #11

Uh oh!

ericvergnaud Dec 15, 2023 Maintainer

Replies: 5 comments · 21 replies

Uh oh!

ftomassetti Dec 28, 2023 Collaborator

Uh oh!

kaby76 Dec 28, 2023 Collaborator

Uh oh!

kaby76 Dec 28, 2023 Collaborator

Uh oh!

ftomassetti Dec 29, 2023 Collaborator

Uh oh!

kaby76 Dec 29, 2023 Collaborator

Uh oh!

ftomassetti Dec 31, 2023 Collaborator

Uh oh!

Uh oh!

kaby76 Dec 31, 2023 Collaborator

Uh oh!

Uh oh!

ericvergnaud Jan 2, 2024 Maintainer Author

Uh oh!

Uh oh!

kaby76 Jan 2, 2024 Collaborator

Uh oh!

ericvergnaud Jan 2, 2024 Maintainer Author

Uh oh!

ftomassetti Jan 2, 2024 Collaborator

Uh oh!

ftomassetti Jan 2, 2024 Collaborator

Uh oh!

KvanTTT Jan 3, 2024 Collaborator

Uh oh!

KvanTTT Jan 3, 2024 Collaborator

Uh oh!

kaby76 Jan 3, 2024 Collaborator

Uh oh!

ericvergnaud Jan 3, 2024 Maintainer Author

Uh oh!

ericvergnaud Jan 3, 2024 Maintainer Author

Uh oh!

Uh oh!

KvanTTT Jan 4, 2024 Collaborator

Uh oh!

ericvergnaud Jan 4, 2024 Maintainer Author

Uh oh!

Uh oh!

kaby76 Feb 16, 2024 Collaborator

Uh oh!

ericvergnaud Feb 16, 2024 Maintainer Author

Uh oh!

kaby76 Feb 16, 2024 Collaborator

Uh oh!

KvanTTT Feb 16, 2024 Collaborator

Uh oh!

KvanTTT Feb 16, 2024 Collaborator

Uh oh!

Uh oh!

kaby76 Feb 17, 2024 Collaborator

ericvergnaud
Dec 15, 2023
Maintainer

Replies: 5 comments 21 replies

ftomassetti
Dec 28, 2023
Collaborator

kaby76
Dec 28, 2023
Collaborator

kaby76 Dec 28, 2023
Collaborator

ftomassetti Dec 29, 2023
Collaborator

kaby76 Dec 29, 2023
Collaborator

ftomassetti Dec 31, 2023
Collaborator

kaby76 Dec 31, 2023
Collaborator

ericvergnaud
Jan 2, 2024
Maintainer Author

kaby76 Jan 2, 2024
Collaborator

ericvergnaud Jan 2, 2024
Maintainer Author

ftomassetti Jan 2, 2024
Collaborator

ftomassetti Jan 2, 2024
Collaborator

KvanTTT Jan 3, 2024
Collaborator

KvanTTT
Jan 3, 2024
Collaborator

kaby76 Jan 3, 2024
Collaborator

ericvergnaud Jan 3, 2024
Maintainer Author

ericvergnaud Jan 3, 2024
Maintainer Author

KvanTTT Jan 4, 2024
Collaborator

ericvergnaud Jan 4, 2024
Maintainer Author

kaby76
Feb 16, 2024
Collaborator

ericvergnaud Feb 16, 2024
Maintainer Author

kaby76 Feb 16, 2024
Collaborator

KvanTTT Feb 16, 2024
Collaborator

KvanTTT Feb 16, 2024
Collaborator

kaby76 Feb 17, 2024
Collaborator