-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). #123521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
468f73f
61a2607
5e2ee55
ccf5203
29e9f26
0b33ab7
4103144
29ad86c
41416de
f56c019
5dbf7d2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,156 @@ | ||||||||||||||||||||||||
| //===-- DILLexer.h ----------------------------------------------*- C++ -*-===// | ||||||||||||||||||||||||
| // | ||||||||||||||||||||||||
| // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||||||||||||||||||||||||
| // See https://llvm.org/LICENSE.txt for license information. | ||||||||||||||||||||||||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||||||||||||||||||||||||
| // | ||||||||||||||||||||||||
| //===----------------------------------------------------------------------===// | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #ifndef LLDB_VALUEOBJECT_DILLEXER_H_ | ||||||||||||||||||||||||
| #define LLDB_VALUEOBJECT_DILLEXER_H_ | ||||||||||||||||||||||||
|
||||||||||||||||||||||||
| #ifndef LLDB_VALUEOBJECT_DILLEXER_H_ | |
| #define LLDB_VALUEOBJECT_DILLEXER_H_ | |
| #ifndef LLDB_VALUEOBJECT_DILLEXER_H | |
| #define LLDB_VALUEOBJECT_DILLEXER_H |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fits in nicer with the other headers (though maybe you don't need it if you remove the UINT_MAX thing below)
| #include <limits.h> | |
| #include <climits> |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| namespace lldb_private { | |
| namespace dil { | |
| namespace lldb_private::dil { |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional idea: Make this a non-class enum inside the (DIL)Token class so that it can be referred to as Token::coloncolon
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have a dill:: namespace, maybe we can drop DIL prefix?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| : m_kind(kind), m_spelling(spelling), m_start_pos(start) {} | |
| : m_kind(kind), m_spelling(std::move(spelling)), m_start_pos(start) {} |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use the assignment operator instead (token = Token(kind, spelling, start)) ?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we accept StringRef as input, we shouldn't copy the data and work with the provided string view (and assume we don't outlive it).
If you need the lexer to own the text (and you probably don't), then accept std::string and move from it.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool Lex(DILToken &result, bool look_ahead = false); | |
| std::optional<DILToken> Lex(bool look_ahead = false); |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool Is_Word(std::string::iterator start, uint32_t &length); | |
| bool IsWord(std::string::iterator start, uint32_t &length); |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should all of these APIs really be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few of them can be private (I'll take care of that). Many of them could be protected (mostly just called from the parser, which I can designate as a friend), except that if I make them protected then my unittest, which needs to access them, can't access them. I've been trying and trying to find a way to allow my unit tests access these methods when they're protected, but I can't seem to make it work. Do you know the magic secret for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are workarounds, but no magic secrets. If something is called by the parser (which is the only user of the class anyway), then it should be public (btw, friends can see even private members, not just the public ones). And unit tests should generally test using the public APIs (as that's what the users will use). There are exceptions to that, but they usually involve situations where the public API requires arguments that would be difficult/impossible to mock in a test, which shouldn't be the case here.
Separating the public API from the private implementation details would really help this class, as I really don't know which one of these functions is meant to be called from the outside.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there are too many ways to navigate the token stream here. You can either call GetCurrentToken+IncrementTokenIdx, or GetNextToken(which I guess increments the index automatically), or LookAhead+AcceptLookAhead.
I think it would be better to start with something simple (we can add more or revamp the existing API if it turns out to be clunky). What would you say to something like:
const Token &LookAhead(uint32_t N /* add `=1` if you want*/);
const Token &GetCurrentToken() { return LookAhead(0); } // just a fancy name for a look ahead of zero
void Advance(uint32_t N = 1); // advance the token stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser really needs a way to save & restore/reset the token index, because there are places in the parser where it does tentative parsing & then decides to rollback. It does so by saving the current token index, then doing the tentative parsing (which can advance the index some number of tokens), and then (for rolling back) setting the current token index back to the saved value.
So I don't think the simple API you've outlined above would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with tentative parse and roll back APIs. I'm commenting on the other APIs which advance through the token stream linearly (but in a very baroque fashion). IOW, my proposal was to replace GetCurrentToken, IncrementTokenIdx, GetNextToken, LookAhead and AcceptLookAhead with the three functions above (exact names TBD), and keep GetCurrentTokenIdx and ResetTokenIdx as they are.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this API might be simpler. The lexer doesn't actually need to re-lex, the results will always be the same. We only need to rollback occasionally, but we'll always be process the same sequence of tokens the second time.
So the lexer can always add the tokens to the m_lexed_tokens vector and we only need GetCurrentTokenIdx() and ResetTokenIdx() to do the rollback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be nice, and if its true, I'd consider taking this even further: given that we're going to be processing about a line of text at most, we might be able to just eagerly lex the whole string and avoid the whole on-demand parsing business. If so, maybe we don't even need the lexer class and the lexing could consist of a single function like std::vector<Token> Lex(string) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re werat's comment: This is already what the lexer does. It never re-lexes anything. It stores the lexed tokens in a vector as it lexes them, and keeps track, via index into the vector, which token is the 'current' one (as far as the parser is concerned). Accepting the lookahead basically involves just updating the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to hear, but it still makes the implementation complicated. Separating lexing from the traversal through the lexed tokens would make things much easier to follow. I understand traditional lexers can't do that, but hey are parsing megabytes of text. We're going to be parsing approximately one line of code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesterday I also thought of another advantage of early/eager parsing -- we can report errors (things like invalid tokens and such) early. Then the other functions ("get next token" and stuff) shouldn't have any failure modes except running off the end of token stream (which I think we could handle by pretending the stream ends with an infinite amount of EOF tokens)
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't Lex() do this automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Lex() is called from LookAhead, when we definitely do not want to automatically increment the token index.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool ResetTokenIdx(uint32_t new_value) { | |
| if (new_value > m_lexed_tokens.size() - 1) | |
| return false; | |
| m_tokens_idx = new_value; | |
| return true; | |
| } | |
| void ResetTokenIdx(uint32_t new_value) { | |
| assert(new_value < m_lexed_tokens.size()); | |
| m_tokens_idx = new_value; | |
| } |
(AIUI, the only usage of this function will be to restore a previous (and valid) position, so any error here is definitely a bug)
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,5 @@ | ||
| add_lldb_library(lldbValueObject | ||
| DILLexer.cpp | ||
| ValueObject.cpp | ||
| ValueObjectCast.cpp | ||
| ValueObjectChild.cpp | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,205 @@ | ||||||||||||||||||||||||||
| //===-- DILLexer.cpp ------------------------------------------------------===// | ||||||||||||||||||||||||||
| // | ||||||||||||||||||||||||||
| // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||||||||||||||||||||||||||
| // See https://llvm.org/LICENSE.txt for license information. | ||||||||||||||||||||||||||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||||||||||||||||||||||||||
| // | ||||||||||||||||||||||||||
| // This implements the recursive descent parser for the Data Inspection | ||||||||||||||||||||||||||
| // Language (DIL), and its helper functions, which will eventually underlie the | ||||||||||||||||||||||||||
| // 'frame variable' command. The language that this parser recognizes is | ||||||||||||||||||||||||||
| // described in lldb/docs/dil-expr-lang.ebnf | ||||||||||||||||||||||||||
| // | ||||||||||||||||||||||||||
| //===----------------------------------------------------------------------===// | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| #include "lldb/ValueObject/DILLexer.h" | ||||||||||||||||||||||||||
| #include "llvm/ADT/StringMap.h" | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| namespace lldb_private { | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| namespace dil { | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // For fast keyword lookup. More keywords will be added later. | ||||||||||||||||||||||||||
| const llvm::StringMap<dil::TokenKind> Keywords = { | ||||||||||||||||||||||||||
| {"namespace", dil::TokenKind::kw_namespace}, | ||||||||||||||||||||||||||
| }; | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| const std::string DILToken::getTokenName(dil::TokenKind kind) { | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| switch (kind) { | ||||||||||||||||||||||||||
| case dil::TokenKind::coloncolon: | ||||||||||||||||||||||||||
| return "coloncolon"; | ||||||||||||||||||||||||||
| case dil::TokenKind::eof: | ||||||||||||||||||||||||||
| return "eof"; | ||||||||||||||||||||||||||
| case dil::TokenKind::identifier: | ||||||||||||||||||||||||||
| return "identifier"; | ||||||||||||||||||||||||||
| case dil::TokenKind::kw_namespace: | ||||||||||||||||||||||||||
| return "namespace"; | ||||||||||||||||||||||||||
| case dil::TokenKind::l_paren: | ||||||||||||||||||||||||||
| return "l_paren"; | ||||||||||||||||||||||||||
| case dil::TokenKind::r_paren: | ||||||||||||||||||||||||||
| return "r_paren"; | ||||||||||||||||||||||||||
| case dil::TokenKind::unknown: | ||||||||||||||||||||||||||
| return "unknown"; | ||||||||||||||||||||||||||
| default: | ||||||||||||||||||||||||||
| return "token_name"; | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| static bool Is_Letter(char c) { | ||||||||||||||||||||||||||
| if (('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')) | ||||||||||||||||||||||||||
| return true; | ||||||||||||||||||||||||||
| return false; | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| static bool Is_Digit(char c) { return ('0' <= c && c <= '9'); } | ||||||||||||||||||||||||||
|
||||||||||||||||||||||||||
| static bool Is_Letter(char c) { | |
| if (('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')) | |
| return true; | |
| return false; | |
| } | |
| static bool Is_Digit(char c) { return ('0' <= c && c <= '9'); } | |
| static bool IsLetter(char c) { | |
| return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z'); | |
| } | |
| static bool IsDigit(char c) { return '0' <= c && c <= '9'; } |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't look like this is used for anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used on lines 74 & 77.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you break out of the loop as soon as you set it to true. (I don't think the meaning of the code would change if you just deleted all references to the variable.
labath marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this is the only use of the start argument, which makes for a very weird API. Perhaps the function could make a note of the starting position internally, and then return the range it found to the caller (I'd suggest a return type of `iterator_rangestring::iterator, with the empty range meaning "no word found")
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ; |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An iterator is sort of a natural representation of a position in a string, but it also makes it hard to use many of the nice (safe) string APIs. For example, if we had a StringRef m_rest to represent the unparsed portion of the string, then this block could be written as:
constexpr std::pair<TokenKind, const char *> operators = {
{TokenKind::l_paren, "("},
{TokenKind::r_paren, ")"},
{TokenKind::coloncolon, "::"},
{TokenKind::colon, ":"},
};
size_t position = m_rest.data() - m_input.data();
for (auto [kind, str]: operators) {
if (m_rest.consume_front(str)) {
UpdateLexedTokens(result, kind, str, position);
return true;
}
}
(Notice how I do not have to worry about running beyond the end of the string, how the "current position" is updated automatically, and how the tokens which are prefixes of one another are handled by putting the longer string first.)
There are also other places where this could be useful. E.g. skipping over whitespace could be implemented as m_rest = m_rest.ltrim() and similar.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of an empty token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly the same as unknown. I will change the TokenKind above to unknown.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the return value here?
Uh oh!
There was an error while loading. Please reload this page.