-
Notifications
You must be signed in to change notification settings - Fork 114
V2: оптимизация лескического разбора #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRefactors the identifiers trie, centralizes special-word classification with a new WordType enum and trie, adds SourceCodeIterator.ReadNextChar, updates string/word lexer states to use the new traversal and word-type lookup, adds UnaryPlus/UnaryMinus tokens, and hardens ScriptException.Message bounds handling. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/OneScript.Language/ScriptException.cs (1)
95-120: Property getter mutates instance state.The
Messageproperty getter modifiesColumnNumber(lines 102-105) when it exceedscodeLine.Length. Property getters should be side-effect free; mutating state during message formatting can lead to unexpected behavior, especially ifMessageis accessed multiple times or during debugging/logging.Consider clamping the column value locally for display purposes without modifying the stored
ColumnNumber:🔎 Proposed fix to eliminate side effect
public override string Message { get { var sb = new StringBuilder(MessageWithoutCodeFragment); sb.AppendLine(); var codeLine = Code?.Replace('\t', ' ')?.TrimEnd() ?? String.Empty; - if (ColumnNumber > codeLine.Length) - { - ColumnNumber = codeLine.Length; - } + var displayColumn = ColumnNumber > codeLine.Length ? codeLine.Length : ColumnNumber; - if (ColumnNumber != ErrorPositionInfo.OUT_OF_TEXT) + if (displayColumn != ErrorPositionInfo.OUT_OF_TEXT) { - sb.Append(codeLine[..ColumnNumber]); + sb.Append(codeLine[..displayColumn]); sb.Append("<<?>>"); - sb.AppendLine(codeLine[ColumnNumber..]); + sb.AppendLine(codeLine[displayColumn..]); } else { sb.AppendLine(codeLine); } return sb.ToString(); } }src/OneScript.Language/LanguageDef.cs (1)
465-469: IsNullString should use the _specwords trie.At line 77,
"NULL"is added to_specwordswithWordType.Null, butIsNullString(lines 466-469) still usesstring.Compareinstead of leveraging the trie lookup. This is inconsistent with the other classification methods.🔎 Proposed refactor to use _specwords
[MethodImpl(MethodImplOptions.AggressiveInlining)] public static bool IsNullString(string value) { - return string.Compare(value, "Null", StringComparison.OrdinalIgnoreCase) == 0; + return _specwords.TryGetValue(value, out var wordType) && wordType == WordType.Null; }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/OneScript.Language/IdentifiersTrie.cs(4 hunks)src/OneScript.Language/LanguageDef.cs(9 hunks)src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs(8 hunks)src/OneScript.Language/LexicalAnalysis/StringLexerState.cs(1 hunks)src/OneScript.Language/LexicalAnalysis/Token.cs(2 hunks)src/OneScript.Language/LexicalAnalysis/WordLexerState.cs(1 hunks)src/OneScript.Language/ScriptException.cs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.cs
📄 CodeRabbit inference engine (.cursor/rules/langversion.mdc)
Do not use C# language features that are not available in C# 8 when generating code for projects using .NET 8.0 with LangVersion 8
Files:
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cssrc/OneScript.Language/ScriptException.cssrc/OneScript.Language/LexicalAnalysis/StringLexerState.cssrc/OneScript.Language/LanguageDef.cssrc/OneScript.Language/IdentifiersTrie.cssrc/OneScript.Language/LexicalAnalysis/Token.cssrc/OneScript.Language/LexicalAnalysis/WordLexerState.cs
🧠 Learnings (2)
📚 Learning: 2025-06-19T08:42:20.073Z
Learnt from: EvilBeaver
Repo: EvilBeaver/OneScript PR: 1553
File: src/OneScript.Language/SyntaxAnalysis/ImportDirectivesHandler.cs:50-52
Timestamp: 2025-06-19T08:42:20.073Z
Learning: In OneScript lexers, whitespace is automatically handled/skipped and lexers never return lexems for whitespace characters. The NonWhitespaceLexerState is only invoked for actual non-whitespace characters, not for spaces or tabs.
Applied to files:
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cssrc/OneScript.Language/LexicalAnalysis/StringLexerState.cssrc/OneScript.Language/LanguageDef.cssrc/OneScript.Language/LexicalAnalysis/WordLexerState.cs
📚 Learning: 2025-12-18T16:13:05.448Z
Learnt from: Mr-Rm
Repo: EvilBeaver/OneScript PR: 1636
File: src/OneScript.StandardLibrary/StringOperations.cs:155-157
Timestamp: 2025-12-18T16:13:05.448Z
Learning: Guideline: In OneScript, when a method parameter has a concrete type (e.g., string) and is called from scripts with omitted arguments, the runtime passes an empty string "" rather than null. Direct (non-script) calls to the method may still pass null, so implement defensive null checks for these parameters in public methods that can be called from scripts. Treat empty string as a legitimate value for script calls, and explicitly handle null for direct calls (e.g., fail fast, throw ArgumentNullException, or normalize to "" where appropriate). This should apply to all C# methods that may be invoked from OneScript with optional string parameters across the codebase.
Applied to files:
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cssrc/OneScript.Language/ScriptException.cssrc/OneScript.Language/LexicalAnalysis/StringLexerState.cssrc/OneScript.Language/LanguageDef.cssrc/OneScript.Language/IdentifiersTrie.cssrc/OneScript.Language/LexicalAnalysis/Token.cssrc/OneScript.Language/LexicalAnalysis/WordLexerState.cs
🧬 Code graph analysis (4)
src/OneScript.Language/LexicalAnalysis/StringLexerState.cs (3)
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (5)
SourceCodeIterator(14-252)SourceCodeIterator(32-36)SourceCodeIterator(38-41)ReadNextChar(167-182)MoveNext(89-108)src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (1)
Lexem(12-86)src/OneScript.Language/SpecialChars.cs (1)
SpecialChars(12-54)
src/OneScript.Language/LanguageDef.cs (1)
src/OneScript.Language/IdentifiersTrie.cs (1)
IdentifiersTrie(14-175)
src/OneScript.Language/IdentifiersTrie.cs (1)
src/OneScript.Core/Commons/IndexedNameValueCollection.cs (3)
Add(24-33)Add(35-44)TryGetValue(81-94)
src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (4)
src/OneScript.Language/LexicalAnalysis/CodeRange.cs (2)
CodeRange(12-16)CodeRange(18-21)src/OneScript.Language/SpecialChars.cs (2)
SpecialChars(12-54)IsDelimiter(49-52)src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (3)
MoveNext(89-108)GetContents(248-251)ReadNextChar(167-182)src/OneScript.Language/LexicalAnalysis/OperatorLexerState.cs (2)
Lexem(12-45)Lexem(47-61)
🔇 Additional comments (15)
src/OneScript.Language/LexicalAnalysis/Token.cs (1)
46-66: Improved token organization.The reorganization of unary and binary operators with explicit grouping comments enhances code readability and makes the operator taxonomy clearer. The comment "recommend to be in continuous block" for binary operators is helpful for maintainability.
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (3)
21-21: Good optimization with cached length.Introducing
_codeLengthto cache the source code length is a sensible optimization that eliminates repeatedLengthproperty accesses throughout the iterator's lifetime. The field is correctly initialized inInitOnStringand consistently used for all boundary checks.Also applies to: 48-48
167-182: Verify state consistency when using ReadNextChar.The new
ReadNextCharmethod skips whitespace and returns the current character, but unlikeMoveToContent, it doesn't update_startPosition. This could lead to issues if callers assume the content start position is properly set after calling this method.In
WordLexerState.csline 69,ReadNextChar()is called to check for'('after a built-in function identifier. This usage appears safe since it's only for lookahead. However, ensure that any future uses ofReadNextChardon't rely on_startPositionbeing updated.Consider adding documentation to clarify the method's behavior:
🔎 Suggested documentation
+/// <summary> +/// Skips whitespace and returns the next non-whitespace character. +/// Note: Unlike MoveToContent, this does not update _startPosition. +/// Use this only for lookahead scenarios where content extraction is not needed. +/// </summary> public char ReadNextChar() { while (Char.IsWhiteSpace(_currentSymbol)) { if (_currentSymbol == '\n') { _onNewLine = true; } if (!MoveNext()) { break; } } return _currentSymbol; }
200-203: Good defensive coding.The early return in
GetCodeLinewhenstart >= _codeLengthprevents potential out-of-bounds access and handles edge cases gracefully by returning an empty string.src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (2)
16-23: Efficient token extraction.The refactored approach advances the iterator to the next delimiter in a single loop, then extracts content once. This is more efficient than incrementally checking and extracting content character-by-character.
65-83: Verify built-in function validation doesn't consume characters.At line 69,
ReadNextChar()is used to peek ahead for'('after a built-in function identifier. While this correctly advances past whitespace to check for the opening parenthesis, be aware that:
ReadNextChar()advances the iterator's position but doesn't update_startPosition(see SourceCodeIterator.cs review).- The character at the new position is returned but not "consumed" in the lexical sense.
- This appears correct for lookahead validation, but ensure the next lexer state properly handles the current position after this check.
The logic invalidates the token if
'('isn't found, treating the identifier as a regular user symbol instead of a built-in function call. This behavior seems intentional.Verify that after
ReadNextChar()returns a character that's not'(', the subsequent lexer iteration correctly processes that character (it should, since the iterator is positioned at that character).src/OneScript.Language/LexicalAnalysis/StringLexerState.cs (2)
14-37: Comment handling in string context seems overly restrictive.The
SkipSpacesAndCommentsmethod throws "Некорректный символ" (Incorrect symbol) when it encounters a single/not followed by another/(lines 21, 24).In the context of string literal parsing, this behavior may be correct if
/is not allowed between string concatenations. However, consider whether this restriction is intentional or if a single/should be handled differently (e.g., as the start of an operator token that ends the string literal context).Verify that this strict comment enforcement is the desired behavior for the string lexer state, particularly when a
/appears in the position where string concatenation or line continuation is expected.
39-94: Clean refactor with ReadNextChar integration.The refactored string parsing logic successfully integrates
ReadNextChar()for whitespace handling while preserving the core string literal parsing behavior. The direct inline return of theLexem(lines 71-75) is cleaner than using an intermediate variable.Minor improvement: The error message changes (removing exclamation marks) provide consistency across error reporting.
src/OneScript.Language/LanguageDef.cs (4)
26-80: Excellent refactoring with centralized classification.The introduction of the
WordTypeenum and the_specwordstrie centralizes special word classification, replacing scattered boolean checks with a unified, efficient lookup mechanism. This improves maintainability and performance by:
- Providing a clear taxonomy of word types
- Enabling single-lookup classification via
TryGetValue- Reducing code duplication across classification methods
The initialization is comprehensive and covers all necessary mappings for Russian and English keywords.
270-280: Efficient token lookup.Using
TryGetValuewith an inlineoutparameter is more efficient than a two-step lookup pattern. The code is compatible with C# 8 requirements.
294-315: Switch statement improves clarity and performance.Converting
IsBinaryOperatorto a switch statement improves both readability and performance. The explicit listing of all binary operators makes the logic clear and aligns with the binary operator grouping in theTokenenum.
428-445: More complete block-ending token detection.The expanded
IsEndOfBlockTokennow includes additional tokens (EndIf,EndProcedure,EndFunction,Else,ElseIf,Exception) that logically terminate code blocks. This makes the method more comprehensive and correct for parsing block structures.src/OneScript.Language/IdentifiersTrie.cs (3)
18-44: Improved TrieNode design.The refactored
TrieNodewith internal fields and explicit constructors is cleaner and more efficient. ThehasValueflag properly distinguishes between nodes that are part of a path versus terminal nodes with stored values. The case-insensitive character matching viacharLandcharUis well-designed.
81-136: Good delegation pattern with TryGetValue.Making
TryGetValuethe single authoritative lookup implementation and having bothContainsKeyandGetdelegate to it eliminates code duplication and ensures consistency. The traversal logic correctly:
- Navigates sibling chains for each character
- Returns early with
defaultif the path doesn't exist- Checks the
hasValueflag to distinguish between intermediate and terminal nodes
46-79: The Add method implementation is correct and well-tested.The traversal logic, while intricate, correctly implements a case-insensitive trie structure. Comprehensive unit tests in
TrieTests.csvalidate all scenarios including single/multi-character keys, shared prefixes, case-insensitive retrieval, and partial-key exclusion. The code uses no C# features beyond 3.0 and is fully compatible with LangVersion 8. No changes needed.
|
@cursor review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on January 14
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| if (ColumnNumber > codeLine.Length) | ||
| { | ||
| ColumnNumber = codeLine.Length; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message getter mutates ColumnNumber causing state corruption
The Message property getter modifies the ColumnNumber property when it exceeds codeLine.Length. Since ColumnNumber is a setter that persists to _codePosition.ColumnNumber, this causes permanent state corruption. The original column value is lost after the first access to Message. This affects subsequent accesses to Message, MessageWithoutCodeFragment, and any code that reads ColumnNumber or calls GetPosition(). A local variable should be used for the adjusted column value instead of modifying the property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Здесь исправляется некорректное значение ColumnNumber для крайне редких случаев синтаксической ошибки в последней строке кода, причем не завершенной символом перевода строки. Далее только демонстрация ошибки и прерывание компиляции, никаких "subsequent accesses to Message" и т. п.
Более глобально: ColumnNumber нужен только для указания на позицию ошибки
| if (iterator.ReadNextChar() != '(') | ||
| { | ||
| tok = Token.NotAToken; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadNextChar ignores StayOnSameLine flag unlike SkipSpaces
The new ReadNextChar method does not respect the StayOnSameLine flag, unlike the original SkipSpaces method it replaces in WordLexerState. When StayOnSameLine is true (e.g., during NextLexemOnSameLine for preprocessor directives), the old code would stop at newlines when checking if a built-in function is followed by (. The new code skips past newlines regardless, potentially recognizing function calls that span multiple lines when they shouldn't be recognized in same-line contexts.
Additional Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadNextChar вызывается только для случаев, когда признак StayOnSameLine не имеет значения. Кроме того, d отличие от SkipSpaces, отсутствует проверка на начало кода, так как функция заведомо вызывается после хотя бы одного прочитанного символа, а так же проверка на конец кода - в этом случае будет возвращен заведомо ошибочный символ '\0'.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.
Note
Refactors lexing internals and identifier trie, adds unary-plus/minus tokens and word classification via a unified trie, introduces a char-read API, and clamps exception message positions.
SourceCodeIterator.ReadNextChar()and cache_codeLength; adjustCurrentColumncalc.StringLexerStateandWordLexerStateto useReadNextChar(), simplify delimiter handling, and streamline string literal continuation.Token.UnaryPlus/UnaryMinusand set priorities.IdentifiersTrie<WordType>withLanguageDef.GetWordType(); update boolean/undefined/logical/null/preproc detection to use it.IdentifiersTrie<T>internals (node fields, add/find logic, value flags) and rewriteAdd,TryGetValue,Get,ContainsKeyfor faster traversal.ScriptException.Message, safely handle empty/short code lines and clamp column index to avoid out-of-bounds.Written by Cursor Bugbot for commit a7c5fad. This will update automatically on new commits. Configure here.