Skip to content

Conversation

@Mr-Rm
Copy link
Collaborator

@Mr-Rm Mr-Rm commented Dec 21, 2025

Summary by CodeRabbit

  • New Features

    • Added word classification API and a value-lookup method for identifiers.
    • Added a public method to advance/read next character.
    • Introduced unary-plus and unary-minus token kinds.
  • Bug Fixes

    • Improved exception message rendering to avoid out-of-bounds errors for short or missing code lines.
  • Refactor

    • Reworked identifier storage and lexical/token recognition internals for clearer traversal and classification.

✏️ Tip: You can customize this high-level summary in your review settings.


Note

Refactors lexing internals and identifier trie, adds unary-plus/minus tokens and word classification via a unified trie, introduces a char-read API, and clamps exception message positions.

  • Lexer/Core APIs:
    • Add SourceCodeIterator.ReadNextChar() and cache _codeLength; adjust CurrentColumn calc.
    • Rework StringLexerState and WordLexerState to use ReadNextChar(), simplify delimiter handling, and streamline string literal continuation.
  • Token/Language definition:
    • Introduce Token.UnaryPlus/UnaryMinus and set priorities.
    • Consolidate special words into IdentifiersTrie<WordType> with LanguageDef.GetWordType(); update boolean/undefined/logical/null/preproc detection to use it.
    • Minor switches/simplifications in operator/literal checks.
  • Identifiers storage:
    • Refactor IdentifiersTrie<T> internals (node fields, add/find logic, value flags) and rewrite Add, TryGetValue, Get, ContainsKey for faster traversal.
  • Error handling:
    • In ScriptException.Message, safely handle empty/short code lines and clamp column index to avoid out-of-bounds.

Written by Cursor Bugbot for commit a7c5fad. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 21, 2025

Walkthrough

Refactors the identifiers trie, centralizes special-word classification with a new WordType enum and trie, adds SourceCodeIterator.ReadNextChar, updates string/word lexer states to use the new traversal and word-type lookup, adds UnaryPlus/UnaryMinus tokens, and hardens ScriptException.Message bounds handling.

Changes

Cohort / File(s) Summary
Trie Data Structure
src/OneScript.Language/IdentifiersTrie.cs
Restructured TrieNode to use internal fields (charL, charU, sibl, next), added hasValue/value, introduced constructors, simplified traversal, and added TryGetValue(string, out T).
Word Type Classification
src/OneScript.Language/LanguageDef.cs
Added public nested WordType enum and private _specwords trie; consolidated multiple boolean maps into _specwords; added GetWordType(string) and updated token/literal/operator checks to use WordType.
Lexer Core & State Machines
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs, src/OneScript.Language/LexicalAnalysis/StringLexerState.cs, src/OneScript.Language/LexicalAnalysis/WordLexerState.cs
Introduced _codeLength and ReadNextChar() in SourceCodeIterator; replaced length checks to use _codeLength; StringLexerState and WordLexerState refactored to use ReadNextChar and centralized GetWordType-based token logic; WordLexerState now computes CodeRange for locations.
Token Definitions
src/OneScript.Language/LexicalAnalysis/Token.cs
Added UnaryPlus and UnaryMinus enum members and reorganized token grouping/comments.
Exception Safety
src/OneScript.Language/ScriptException.cs
Made Message rendering null-safe for Code and clamps ColumnNumber to code line length to avoid out-of-bounds indexing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to IdentifiersTrie.TryGetValue and hasValue lifecycle.
  • Verify LanguageDef._specwords coverage and all call sites updated to use GetWordType.
  • Validate SourceCodeIterator.ReadNextChar behavior across lexer states and newline tracking.

Possibly related PRs

Suggested reviewers

  • EvilBeaver

Poem

🐰 A trie now hums with nodes aligned,
WordTypes sort the words we find,
ReadNextChar hops, then bounds unfold,
Lexers dance as tokens told,
Small changes, big paths — a rabbit's cheerful mind. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the changeset: optimizing lexical parsing through refactoring of IdentifiersTrie, LanguageDef, SourceCodeIterator, and related lexer state classes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ea8b74 and a7c5fad.

📒 Files selected for processing (1)
  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/OneScript.Language/ScriptException.cs (1)

95-120: Property getter mutates instance state.

The Message property getter modifies ColumnNumber (lines 102-105) when it exceeds codeLine.Length. Property getters should be side-effect free; mutating state during message formatting can lead to unexpected behavior, especially if Message is accessed multiple times or during debugging/logging.

Consider clamping the column value locally for display purposes without modifying the stored ColumnNumber:

🔎 Proposed fix to eliminate side effect
 public override string Message
 {
     get
     {
         var sb = new StringBuilder(MessageWithoutCodeFragment);
         sb.AppendLine();
         var codeLine = Code?.Replace('\t', ' ')?.TrimEnd() ?? String.Empty;
-        if (ColumnNumber > codeLine.Length)
-        {
-            ColumnNumber = codeLine.Length;
-        }
+        var displayColumn = ColumnNumber > codeLine.Length ? codeLine.Length : ColumnNumber;

-        if (ColumnNumber != ErrorPositionInfo.OUT_OF_TEXT)
+        if (displayColumn != ErrorPositionInfo.OUT_OF_TEXT)
         {
-            sb.Append(codeLine[..ColumnNumber]);
+            sb.Append(codeLine[..displayColumn]);
             sb.Append("<<?>>");
-            sb.AppendLine(codeLine[ColumnNumber..]);
+            sb.AppendLine(codeLine[displayColumn..]);
         }
         else
         {
             sb.AppendLine(codeLine);
         }

         return sb.ToString();
     }
 }
src/OneScript.Language/LanguageDef.cs (1)

465-469: IsNullString should use the _specwords trie.

At line 77, "NULL" is added to _specwords with WordType.Null, but IsNullString (lines 466-469) still uses string.Compare instead of leveraging the trie lookup. This is inconsistent with the other classification methods.

🔎 Proposed refactor to use _specwords
 [MethodImpl(MethodImplOptions.AggressiveInlining)]
 public static bool IsNullString(string value)
 {
-    return string.Compare(value, "Null", StringComparison.OrdinalIgnoreCase) == 0;
+    return _specwords.TryGetValue(value, out var wordType) && wordType == WordType.Null;
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f4b8b27 and 8ea8b74.

📒 Files selected for processing (7)
  • src/OneScript.Language/IdentifiersTrie.cs (4 hunks)
  • src/OneScript.Language/LanguageDef.cs (9 hunks)
  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (8 hunks)
  • src/OneScript.Language/LexicalAnalysis/StringLexerState.cs (1 hunks)
  • src/OneScript.Language/LexicalAnalysis/Token.cs (2 hunks)
  • src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (1 hunks)
  • src/OneScript.Language/ScriptException.cs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.cs

📄 CodeRabbit inference engine (.cursor/rules/langversion.mdc)

Do not use C# language features that are not available in C# 8 when generating code for projects using .NET 8.0 with LangVersion 8

Files:

  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs
  • src/OneScript.Language/ScriptException.cs
  • src/OneScript.Language/LexicalAnalysis/StringLexerState.cs
  • src/OneScript.Language/LanguageDef.cs
  • src/OneScript.Language/IdentifiersTrie.cs
  • src/OneScript.Language/LexicalAnalysis/Token.cs
  • src/OneScript.Language/LexicalAnalysis/WordLexerState.cs
🧠 Learnings (2)
📚 Learning: 2025-06-19T08:42:20.073Z
Learnt from: EvilBeaver
Repo: EvilBeaver/OneScript PR: 1553
File: src/OneScript.Language/SyntaxAnalysis/ImportDirectivesHandler.cs:50-52
Timestamp: 2025-06-19T08:42:20.073Z
Learning: In OneScript lexers, whitespace is automatically handled/skipped and lexers never return lexems for whitespace characters. The NonWhitespaceLexerState is only invoked for actual non-whitespace characters, not for spaces or tabs.

Applied to files:

  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs
  • src/OneScript.Language/LexicalAnalysis/StringLexerState.cs
  • src/OneScript.Language/LanguageDef.cs
  • src/OneScript.Language/LexicalAnalysis/WordLexerState.cs
📚 Learning: 2025-12-18T16:13:05.448Z
Learnt from: Mr-Rm
Repo: EvilBeaver/OneScript PR: 1636
File: src/OneScript.StandardLibrary/StringOperations.cs:155-157
Timestamp: 2025-12-18T16:13:05.448Z
Learning: Guideline: In OneScript, when a method parameter has a concrete type (e.g., string) and is called from scripts with omitted arguments, the runtime passes an empty string "" rather than null. Direct (non-script) calls to the method may still pass null, so implement defensive null checks for these parameters in public methods that can be called from scripts. Treat empty string as a legitimate value for script calls, and explicitly handle null for direct calls (e.g., fail fast, throw ArgumentNullException, or normalize to "" where appropriate). This should apply to all C# methods that may be invoked from OneScript with optional string parameters across the codebase.

Applied to files:

  • src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs
  • src/OneScript.Language/ScriptException.cs
  • src/OneScript.Language/LexicalAnalysis/StringLexerState.cs
  • src/OneScript.Language/LanguageDef.cs
  • src/OneScript.Language/IdentifiersTrie.cs
  • src/OneScript.Language/LexicalAnalysis/Token.cs
  • src/OneScript.Language/LexicalAnalysis/WordLexerState.cs
🧬 Code graph analysis (4)
src/OneScript.Language/LexicalAnalysis/StringLexerState.cs (3)
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (5)
  • SourceCodeIterator (14-252)
  • SourceCodeIterator (32-36)
  • SourceCodeIterator (38-41)
  • ReadNextChar (167-182)
  • MoveNext (89-108)
src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (1)
  • Lexem (12-86)
src/OneScript.Language/SpecialChars.cs (1)
  • SpecialChars (12-54)
src/OneScript.Language/LanguageDef.cs (1)
src/OneScript.Language/IdentifiersTrie.cs (1)
  • IdentifiersTrie (14-175)
src/OneScript.Language/IdentifiersTrie.cs (1)
src/OneScript.Core/Commons/IndexedNameValueCollection.cs (3)
  • Add (24-33)
  • Add (35-44)
  • TryGetValue (81-94)
src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (4)
src/OneScript.Language/LexicalAnalysis/CodeRange.cs (2)
  • CodeRange (12-16)
  • CodeRange (18-21)
src/OneScript.Language/SpecialChars.cs (2)
  • SpecialChars (12-54)
  • IsDelimiter (49-52)
src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (3)
  • MoveNext (89-108)
  • GetContents (248-251)
  • ReadNextChar (167-182)
src/OneScript.Language/LexicalAnalysis/OperatorLexerState.cs (2)
  • Lexem (12-45)
  • Lexem (47-61)
🔇 Additional comments (15)
src/OneScript.Language/LexicalAnalysis/Token.cs (1)

46-66: Improved token organization.

The reorganization of unary and binary operators with explicit grouping comments enhances code readability and makes the operator taxonomy clearer. The comment "recommend to be in continuous block" for binary operators is helpful for maintainability.

src/OneScript.Language/LexicalAnalysis/SourceCodeIterator.cs (3)

21-21: Good optimization with cached length.

Introducing _codeLength to cache the source code length is a sensible optimization that eliminates repeated Length property accesses throughout the iterator's lifetime. The field is correctly initialized in InitOnString and consistently used for all boundary checks.

Also applies to: 48-48


167-182: Verify state consistency when using ReadNextChar.

The new ReadNextChar method skips whitespace and returns the current character, but unlike MoveToContent, it doesn't update _startPosition. This could lead to issues if callers assume the content start position is properly set after calling this method.

In WordLexerState.cs line 69, ReadNextChar() is called to check for '(' after a built-in function identifier. This usage appears safe since it's only for lookahead. However, ensure that any future uses of ReadNextChar don't rely on _startPosition being updated.

Consider adding documentation to clarify the method's behavior:

🔎 Suggested documentation
+/// <summary>
+/// Skips whitespace and returns the next non-whitespace character.
+/// Note: Unlike MoveToContent, this does not update _startPosition.
+/// Use this only for lookahead scenarios where content extraction is not needed.
+/// </summary>
 public char ReadNextChar()
 {
     while (Char.IsWhiteSpace(_currentSymbol))
     {
         if (_currentSymbol == '\n')
         {
             _onNewLine = true;
         }
         if (!MoveNext())
         {
             break;
         }
     }

     return _currentSymbol;
 }

200-203: Good defensive coding.

The early return in GetCodeLine when start >= _codeLength prevents potential out-of-bounds access and handles edge cases gracefully by returning an empty string.

src/OneScript.Language/LexicalAnalysis/WordLexerState.cs (2)

16-23: Efficient token extraction.

The refactored approach advances the iterator to the next delimiter in a single loop, then extracts content once. This is more efficient than incrementally checking and extracting content character-by-character.


65-83: Verify built-in function validation doesn't consume characters.

At line 69, ReadNextChar() is used to peek ahead for '(' after a built-in function identifier. While this correctly advances past whitespace to check for the opening parenthesis, be aware that:

  1. ReadNextChar() advances the iterator's position but doesn't update _startPosition (see SourceCodeIterator.cs review).
  2. The character at the new position is returned but not "consumed" in the lexical sense.
  3. This appears correct for lookahead validation, but ensure the next lexer state properly handles the current position after this check.

The logic invalidates the token if '(' isn't found, treating the identifier as a regular user symbol instead of a built-in function call. This behavior seems intentional.

Verify that after ReadNextChar() returns a character that's not '(', the subsequent lexer iteration correctly processes that character (it should, since the iterator is positioned at that character).

src/OneScript.Language/LexicalAnalysis/StringLexerState.cs (2)

14-37: Comment handling in string context seems overly restrictive.

The SkipSpacesAndComments method throws "Некорректный символ" (Incorrect symbol) when it encounters a single / not followed by another / (lines 21, 24).

In the context of string literal parsing, this behavior may be correct if / is not allowed between string concatenations. However, consider whether this restriction is intentional or if a single / should be handled differently (e.g., as the start of an operator token that ends the string literal context).

Verify that this strict comment enforcement is the desired behavior for the string lexer state, particularly when a / appears in the position where string concatenation or line continuation is expected.


39-94: Clean refactor with ReadNextChar integration.

The refactored string parsing logic successfully integrates ReadNextChar() for whitespace handling while preserving the core string literal parsing behavior. The direct inline return of the Lexem (lines 71-75) is cleaner than using an intermediate variable.

Minor improvement: The error message changes (removing exclamation marks) provide consistency across error reporting.

src/OneScript.Language/LanguageDef.cs (4)

26-80: Excellent refactoring with centralized classification.

The introduction of the WordType enum and the _specwords trie centralizes special word classification, replacing scattered boolean checks with a unified, efficient lookup mechanism. This improves maintainability and performance by:

  1. Providing a clear taxonomy of word types
  2. Enabling single-lookup classification via TryGetValue
  3. Reducing code duplication across classification methods

The initialization is comprehensive and covers all necessary mappings for Russian and English keywords.


270-280: Efficient token lookup.

Using TryGetValue with an inline out parameter is more efficient than a two-step lookup pattern. The code is compatible with C# 8 requirements.


294-315: Switch statement improves clarity and performance.

Converting IsBinaryOperator to a switch statement improves both readability and performance. The explicit listing of all binary operators makes the logic clear and aligns with the binary operator grouping in the Token enum.


428-445: More complete block-ending token detection.

The expanded IsEndOfBlockToken now includes additional tokens (EndIf, EndProcedure, EndFunction, Else, ElseIf, Exception) that logically terminate code blocks. This makes the method more comprehensive and correct for parsing block structures.

src/OneScript.Language/IdentifiersTrie.cs (3)

18-44: Improved TrieNode design.

The refactored TrieNode with internal fields and explicit constructors is cleaner and more efficient. The hasValue flag properly distinguishes between nodes that are part of a path versus terminal nodes with stored values. The case-insensitive character matching via charL and charU is well-designed.


81-136: Good delegation pattern with TryGetValue.

Making TryGetValue the single authoritative lookup implementation and having both ContainsKey and Get delegate to it eliminates code duplication and ensures consistency. The traversal logic correctly:

  1. Navigates sibling chains for each character
  2. Returns early with default if the path doesn't exist
  3. Checks the hasValue flag to distinguish between intermediate and terminal nodes

46-79: The Add method implementation is correct and well-tested.

The traversal logic, while intricate, correctly implements a case-insensitive trie structure. Comprehensive unit tests in TrieTests.cs validate all scenarios including single/multi-character keys, shared prefixes, case-insensitive retrieval, and partial-key exclusion. The code uses no C# features beyond 3.0 and is fully compatible with LangVersion 8. No changes needed.

@EvilBeaver
Copy link
Owner

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on January 14

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

if (ColumnNumber > codeLine.Length)
{
ColumnNumber = codeLine.Length;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message getter mutates ColumnNumber causing state corruption

The Message property getter modifies the ColumnNumber property when it exceeds codeLine.Length. Since ColumnNumber is a setter that persists to _codePosition.ColumnNumber, this causes permanent state corruption. The original column value is lost after the first access to Message. This affects subsequent accesses to Message, MessageWithoutCodeFragment, and any code that reads ColumnNumber or calls GetPosition(). A local variable should be used for the adjusted column value instead of modifying the property.

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здесь исправляется некорректное значение ColumnNumber для крайне редких случаев синтаксической ошибки в последней строке кода, причем не завершенной символом перевода строки. Далее только демонстрация ошибки и прерывание компиляции, никаких "subsequent accesses to Message" и т. п.
Более глобально: ColumnNumber нужен только для указания на позицию ошибки

if (iterator.ReadNextChar() != '(')
{
tok = Token.NotAToken;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadNextChar ignores StayOnSameLine flag unlike SkipSpaces

The new ReadNextChar method does not respect the StayOnSameLine flag, unlike the original SkipSpaces method it replaces in WordLexerState. When StayOnSameLine is true (e.g., during NextLexemOnSameLine for preprocessor directives), the old code would stop at newlines when checking if a built-in function is followed by (. The new code skips past newlines regardless, potentially recognizing function calls that span multiple lines when they shouldn't be recognized in same-line contexts.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadNextChar вызывается только для случаев, когда признак StayOnSameLine не имеет значения. Кроме того, d отличие от SkipSpaces, отсутствует проверка на начало кода, так как функция заведомо вызывается после хотя бы одного прочитанного символа, а так же проверка на конец кода - в этом случае будет возвращен заведомо ошибочный символ '\0'.

@EvilBeaver EvilBeaver merged commit 7e7de3a into EvilBeaver:develop Dec 23, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants