Skip to content

Optimised llamacpp_backend.cpp#330

Merged
shubhammalhotra28 merged 1 commit intoRunanywhereAI:mainfrom
VyasGuru:llamabackend
Feb 5, 2026
Merged

Optimised llamacpp_backend.cpp#330
shubhammalhotra28 merged 1 commit intoRunanywhereAI:mainfrom
VyasGuru:llamabackend

Conversation

@VyasGuru
Copy link
Copy Markdown
Collaborator

@VyasGuru VyasGuru commented Feb 4, 2026

Changes

  • using a LUT & DFA based utf-8 scanning approach. here
  • Converted O(n^2) stop scanning to constant time using a rolling tail buffer.
  • Reserving memory so that the += allocations don't cause too many allocations and memcpys.

Results

  • Tested against old code.
  • 2.8x ~ 3.3x improvement

Important

Optimized llamacpp_backend.cpp with DFA-based UTF-8 validation and efficient stop sequence detection, improving performance by 2.8x ~ 3.3x.

  • UTF-8 Scanning:
    • Replaced is_valid_utf8() with Utf8State struct using DFA for UTF-8 validation.
    • Utilizes a LUT for efficient byte processing.
  • Stop Sequence Detection:
    • Replaced O(n^2) stop scanning with a rolling buffer in generate_stream().
    • Uses stop_window to efficiently detect stop sequences.
  • Memory Optimization:
    • Reserves memory for stop_window and partial_utf8_buffer to reduce allocations.
  • Performance:
    • Achieves 2.8x ~ 3.3x performance improvement in text generation.

This description was created by Ellipsis for a37fb27. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • Bug Fixes

    • Prevents invalid UTF-8 characters in streamed output by buffering and validating partial multibyte sequences.
    • Reliably detects stop sequences across token boundaries, avoiding accidental output beyond stop points.
    • Ensures remaining buffered text is emitted correctly at the end when no stop or cancellation occurs.
  • Refactor

    • Streamlined streaming generation to accumulate and emit text via sliding windows and UTF-8 state validation.
    • Consolidated stop-sequence logic for more predictable behavior.
    • Replaced scattered logs with a unified generation summary log.

Greptile Overview

Greptile Summary

This PR implements significant performance optimizations to the LlamaCpp text generation backend, achieving 2.8x-3.3x speedup through three key improvements:

Key Changes:

  • Replaced simple loop-based UTF-8 validation with Bjoern Hoehrmann's DFA-based state machine for faster, more efficient validation
  • Converted O(n²) stop sequence scanning (checking entire accumulated text) to O(1) using a rolling tail buffer of fixed size
  • Added memory reservations (reserve()) to prevent repeated allocations during string concatenation

Implementation Details:

  • UTF-8 validation now uses a lookup table (LUT) with 364 bytes to drive a deterministic finite automaton
  • Stop sequence detection maintains a rolling window of max 2×MAX_STOP_LEN characters instead of accumulating all generated text
  • Partial UTF-8 sequences are buffered separately until complete characters are validated
  • Only validated UTF-8 text beyond the stop detection window is streamed to callbacks

The optimizations maintain correct behavior for stop sequence detection and UTF-8 handling while dramatically reducing memory allocations and string scanning overhead.

Confidence Score: 4/5

  • This PR is safe to merge with one minor issue to verify
  • The optimizations are well-designed and use established algorithms (Bjoern Hoehrmann's UTF-8 DFA). The rolling window approach for stop sequence detection is sound. However, there's one potential issue with how the UTF-8 scanner state is reinitialized in each iteration that should be verified through testing.
  • The main implementation file should be tested thoroughly to ensure the UTF-8 state machine correctly handles multi-byte sequences that span token boundaries

Important Files Changed

Filename Overview
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp Replaced UTF-8 validation with DFA-based approach and optimized stop sequence detection from O(n²) to O(1) using rolling window buffer

Sequence Diagram

sequenceDiagram
    participant Caller
    participant generate_stream
    participant Sampler as llama_sampler
    participant Utf8State
    participant StopWindow as Rolling Window
    participant Callback

    Caller->>generate_stream: TextGenerationRequest
    activate generate_stream
    
    generate_stream->>generate_stream: Initialize stop_window (reserve MAX_STOP_LEN*2)
    generate_stream->>generate_stream: Initialize partial_utf8_buffer (reserve 8)
    
    loop For each token (up to max_tokens)
        generate_stream->>Sampler: sample next token
        Sampler-->>generate_stream: new_token_id
        
        generate_stream->>generate_stream: Convert token to string
        generate_stream->>generate_stream: Append to partial_utf8_buffer
        
        generate_stream->>Utf8State: Create scanner_state
        
        loop Scan UTF-8 bytes with DFA
            generate_stream->>Utf8State: process(byte)
            Utf8State-->>generate_stream: state == 0 (complete char)
            generate_stream->>generate_stream: Track valid_upto position
        end
        
        alt valid_upto > 0
            generate_stream->>generate_stream: Extract valid_chunk
            generate_stream->>StopWindow: Append valid_chunk
            generate_stream->>generate_stream: Erase from partial_utf8_buffer
            
            loop Check stop sequences (O(1) in rolling window)
                generate_stream->>StopWindow: find(stop_seq)
                alt Stop sequence found
                    generate_stream->>Callback: Send text before stop
                    generate_stream-->>Caller: Break (stop detected)
                end
            end
            
            alt stop_window.size() > MAX_STOP_LEN
                generate_stream->>generate_stream: Calculate safe_len
                generate_stream->>Callback: Send safe text
                generate_stream->>StopWindow: Erase sent portion
            end
        end
        
        generate_stream->>Sampler: Decode next token batch
    end
    
    generate_stream->>Callback: Send remaining stop_window
    deactivate generate_stream
    generate_stream-->>Caller: Generation complete
Loading

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The llamacpp backend generation logic is refactored to use a DFA-based UTF-8 state machine and implement streaming stop-sequence detection via partial buffers and sliding windows. Token text is accumulated incrementally, validated for UTF-8 safety, and checked against stop sequences before callbacks are invoked. No public APIs are modified.

Changes

Cohort / File(s) Summary
UTF-8 Validation & Stop-Sequence Handling
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Replaced ad-hoc UTF-8 validation with DFA-based Utf8State machine; introduced streaming UTF-8 safe handling using partial_utf8_buffer and partial validator; consolidated stop-sequence logic with STOP_SEQUENCES list, MAX_STOP_LEN, and stop_window buffer for incremental detection; refactored generation flow to accumulate token text and detect stop sequences across emitted tokens; added sliding window cleanup to emit safe prefixes and truncate the window; adjusted final callback behavior to emit remaining window content; replaced conditional max_tokens logging with unified summary log.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 UTF-8 hops with validation grace,
State machines dance at streaming pace,
Windows slide where sequences hide,
Buffers bloom with tokens applied,
Callbacks bounce—oh what a race!

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description provides technical implementation details and performance results, but does not follow the repository's standardized template structure with required sections like Type of Change, Testing, Platform-Specific Testing, Labels, and Checklist. Fill out the template with: Type of Change (select Refactoring), Testing section indicating if tests were added/passed, Platform-Specific Testing checkboxes, appropriate Labels (Commons), and complete the Checklist items.
Title check ❓ Inconclusive The title 'Optimised llamacpp_backend.cpp' is vague and describes only the file being changed, not the actual improvements or changes made. Consider a more specific title that captures the main improvement, such as 'Optimize UTF-8 validation and stop sequence detection in llamacpp_backend' or 'Implement DFA-based UTF-8 validation and rolling buffer for stop sequences'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to a37fb27 in 13 seconds. Click for details.
  • Reviewed 222 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_iXCk0y1Tpe6mptga

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)

402-417: ⚠️ Potential issue | 🟠 Major

Token count will be incorrect after this change.

The generate() function counts callback invocations as tokens_generated (line 406). After the buffering optimization, callbacks emit multi-token chunks rather than individual tokens, so result.tokens_generated will report the number of chunks emitted—not actual tokens generated by the model.

This affects:

  • result.tokens_generated accuracy
  • finish_reason logic at line 422 (compares chunk count to max_tokens)

Consider exposing the actual token count from generate_stream().

🐛 Proposed fix: Add output parameter for generated token count
 bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request,
                                              TextStreamCallback callback,
-                                             int* out_prompt_tokens) {
+                                             int* out_prompt_tokens,
+                                             int* out_generated_tokens) {
     // ... existing code ...

+    if (out_generated_tokens) {
+        *out_generated_tokens = tokens_generated;
+    }
+
     LOGI("Generation complete: %d tokens", tokens_generated);
     return !cancel_requested_.load();
 }

Then update generate():

+    int actual_tokens_generated = 0;
     bool success = generate_stream(
         request,
         [&](const std::string& token) -> bool {
             generated_text += token;
-            tokens_generated++;
             return !cancel_requested_.load();
         },
-        &prompt_tokens);
+        &prompt_tokens,
+        &actual_tokens_generated);
+
+    result.tokens_generated = actual_tokens_generated;
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (2)

517-529: Consider adding a safety bound on partial_utf8_buffer size.

If the tokenizer ever produces invalid UTF-8 that cannot complete a valid sequence, partial_utf8_buffer will grow unboundedly since valid_upto would remain 0. While this shouldn't happen with well-behaved tokenizers, a defensive check would prevent potential memory issues.

🛡️ Optional defensive check
         if (valid_upto > 0) {
             std::string valid_chunk = partial_utf8_buffer.substr(0, valid_upto);
             stop_window.append(valid_chunk);
             partial_utf8_buffer.erase(0, valid_upto);
+        } else if (partial_utf8_buffer.size() > 16) {
+            // Safety valve: discard suspected invalid bytes if buffer grows too large
+            LOGE("Invalid UTF-8 detected in token stream, clearing buffer");
+            partial_utf8_buffer.clear();
         }

575-577: Incomplete UTF-8 at end of generation is silently dropped.

If partial_utf8_buffer contains residual bytes when generation ends (e.g., model stopped mid-codepoint), they are neither emitted nor logged. Consider adding a debug/warning log to aid troubleshooting.

🔍 Add visibility for dropped bytes
     if (!cancel_requested_.load() && !stop_sequence_hit && !stop_window.empty()) {
         callback(stop_window);
     }
+
+    if (!partial_utf8_buffer.empty()) {
+        LOGI("Discarded %zu incomplete UTF-8 bytes at end of generation",
+             partial_utf8_buffer.size());
+    }
 
     llama_memory_clear(llama_get_memory(context_), true);

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +26 to +46
bool process(uint8_t byte) {
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

bytes += 1;
for (int i = 1; i < num; ++i) {
if ((*bytes & 0xC0) != 0x80)
return false;
bytes += 1;
}
uint32_t type = utf8d[byte];
state = utf8d[256 + state * 16 + type];
return (state == 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the process() method doesn't update state correctly - the return value indicates if a character completed a valid sequence, but the state itself is internal to the DFA

Suggested change
bool process(uint8_t byte) {
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};
bytes += 1;
for (int i = 1; i < num; ++i) {
if ((*bytes & 0xC0) != 0x80)
return false;
bytes += 1;
}
uint32_t type = utf8d[byte];
state = utf8d[256 + state * 16 + type];
return (state == 0);
bool process(uint8_t byte) {
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};
uint32_t type = utf8d[byte];
uint32_t new_state = utf8d[256 + state * 16 + type];
state = new_state;
return (state == 0);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 26:46

Comment:
the `process()` method doesn't update `state` correctly - the return value indicates if a character completed a valid sequence, but the state itself is internal to the DFA

```suggestion
    bool process(uint8_t byte) {
        static const uint8_t utf8d[] = {
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
            0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
            1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
            7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
            8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
            0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
            0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
            0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
            1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
            1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
            1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
            1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
        };

        uint32_t type = utf8d[byte];
        uint32_t new_state = utf8d[256 + state * 16 + type];
        state = new_state;
        return (state == 0);
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +517 to 524
Utf8State scanner_state;
size_t valid_upto = 0;
for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
if (scanner_state.state == 0) {
valid_upto = i + 1;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanner_state should be reused or reset across iterations - creating a new Utf8State inside the loop means each token's UTF-8 validation starts fresh, but partial_utf8_buffer can contain incomplete sequences from previous iterations

Suggested change
Utf8State scanner_state;
size_t valid_upto = 0;
for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
if (scanner_state.state == 0) {
valid_upto = i + 1;
}
}
Utf8State scanner_state;
size_t valid_upto = 0;
for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
bool complete = scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
if (complete) {
valid_upto = i + 1;
}
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 517:524

Comment:
scanner_state should be reused or reset across iterations - creating a new `Utf8State` inside the loop means each token's UTF-8 validation starts fresh, but `partial_utf8_buffer` can contain incomplete sequences from previous iterations

```suggestion
        Utf8State scanner_state;
        size_t valid_upto = 0;
        for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
            bool complete = scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
            if (complete) {
                valid_upto = i + 1;
            }
        }
```

How can I resolve this? If you propose a fix, please make it concise.

@VyasGuru
Copy link
Copy Markdown
Collaborator Author

VyasGuru commented Feb 4, 2026

scanner state is being used correctly. Incomplete sequences are considered in the next round. all is good from my side

@shubhammalhotra28 shubhammalhotra28 merged commit cf742f8 into RunanywhereAI:main Feb 5, 2026
8 of 12 checks passed
@shubhammalhotra28
Copy link
Copy Markdown
Contributor

LGTM, merging. Thanks @VyasGuru

ManthanNimodiya pushed a commit to ManthanNimodiya/runanywhere-sdks that referenced this pull request Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants