Optimised llamacpp_backend.cpp#330
Conversation
Changes:
📝 WalkthroughWalkthroughThe llamacpp backend generation logic is refactored to use a DFA-based UTF-8 state machine and implement streaming stop-sequence detection via partial buffers and sliding windows. Token text is accumulated incrementally, validated for UTF-8 safety, and checked against stop sequences before callbacks are invoked. No public APIs are modified. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to a37fb27 in 13 seconds. Click for details.
- Reviewed
222lines of code in1files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_iXCk0y1Tpe6mptga
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)
402-417:⚠️ Potential issue | 🟠 MajorToken count will be incorrect after this change.
The
generate()function counts callback invocations astokens_generated(line 406). After the buffering optimization, callbacks emit multi-token chunks rather than individual tokens, soresult.tokens_generatedwill report the number of chunks emitted—not actual tokens generated by the model.This affects:
result.tokens_generatedaccuracyfinish_reasonlogic at line 422 (compares chunk count tomax_tokens)Consider exposing the actual token count from
generate_stream().🐛 Proposed fix: Add output parameter for generated token count
bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request, TextStreamCallback callback, - int* out_prompt_tokens) { + int* out_prompt_tokens, + int* out_generated_tokens) { // ... existing code ... + if (out_generated_tokens) { + *out_generated_tokens = tokens_generated; + } + LOGI("Generation complete: %d tokens", tokens_generated); return !cancel_requested_.load(); }Then update
generate():+ int actual_tokens_generated = 0; bool success = generate_stream( request, [&](const std::string& token) -> bool { generated_text += token; - tokens_generated++; return !cancel_requested_.load(); }, - &prompt_tokens); + &prompt_tokens, + &actual_tokens_generated); + + result.tokens_generated = actual_tokens_generated;
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (2)
517-529: Consider adding a safety bound onpartial_utf8_buffersize.If the tokenizer ever produces invalid UTF-8 that cannot complete a valid sequence,
partial_utf8_bufferwill grow unboundedly sincevalid_uptowould remain 0. While this shouldn't happen with well-behaved tokenizers, a defensive check would prevent potential memory issues.🛡️ Optional defensive check
if (valid_upto > 0) { std::string valid_chunk = partial_utf8_buffer.substr(0, valid_upto); stop_window.append(valid_chunk); partial_utf8_buffer.erase(0, valid_upto); + } else if (partial_utf8_buffer.size() > 16) { + // Safety valve: discard suspected invalid bytes if buffer grows too large + LOGE("Invalid UTF-8 detected in token stream, clearing buffer"); + partial_utf8_buffer.clear(); }
575-577: Incomplete UTF-8 at end of generation is silently dropped.If
partial_utf8_buffercontains residual bytes when generation ends (e.g., model stopped mid-codepoint), they are neither emitted nor logged. Consider adding a debug/warning log to aid troubleshooting.🔍 Add visibility for dropped bytes
if (!cancel_requested_.load() && !stop_sequence_hit && !stop_window.empty()) { callback(stop_window); } + + if (!partial_utf8_buffer.empty()) { + LOGI("Discarded %zu incomplete UTF-8 bytes at end of generation", + partial_utf8_buffer.size()); + } llama_memory_clear(llama_get_memory(context_), true);
| bool process(uint8_t byte) { | ||
| static const uint8_t utf8d[] = { | ||
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f | ||
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f | ||
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f | ||
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f | ||
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f | ||
| 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf | ||
| 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df | ||
| 0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef | ||
| 0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff | ||
| 0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0 | ||
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2 | ||
| 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4 | ||
| 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6 | ||
| 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 | ||
| }; | ||
|
|
||
| bytes += 1; | ||
| for (int i = 1; i < num; ++i) { | ||
| if ((*bytes & 0xC0) != 0x80) | ||
| return false; | ||
| bytes += 1; | ||
| } | ||
| uint32_t type = utf8d[byte]; | ||
| state = utf8d[256 + state * 16 + type]; | ||
| return (state == 0); |
There was a problem hiding this comment.
the process() method doesn't update state correctly - the return value indicates if a character completed a valid sequence, but the state itself is internal to the DFA
| bool process(uint8_t byte) { | |
| static const uint8_t utf8d[] = { | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f | |
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f | |
| 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf | |
| 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df | |
| 0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef | |
| 0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff | |
| 0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0 | |
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2 | |
| 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4 | |
| 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6 | |
| 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 | |
| }; | |
| bytes += 1; | |
| for (int i = 1; i < num; ++i) { | |
| if ((*bytes & 0xC0) != 0x80) | |
| return false; | |
| bytes += 1; | |
| } | |
| uint32_t type = utf8d[byte]; | |
| state = utf8d[256 + state * 16 + type]; | |
| return (state == 0); | |
| bool process(uint8_t byte) { | |
| static const uint8_t utf8d[] = { | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f | |
| 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f | |
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f | |
| 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf | |
| 8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df | |
| 0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef | |
| 0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff | |
| 0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0 | |
| 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2 | |
| 1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4 | |
| 1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6 | |
| 1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8 | |
| }; | |
| uint32_t type = utf8d[byte]; | |
| uint32_t new_state = utf8d[256 + state * 16 + type]; | |
| state = new_state; | |
| return (state == 0); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 26:46
Comment:
the `process()` method doesn't update `state` correctly - the return value indicates if a character completed a valid sequence, but the state itself is internal to the DFA
```suggestion
bool process(uint8_t byte) {
static const uint8_t utf8d[] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 00..1f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 20..3f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 40..5f
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, // 60..7f
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9, // 80..9f
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, // a0..bf
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, // c0..df
0xa,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x3,0x4,0x3,0x3, // e0..ef
0xb,0x6,0x6,0x6,0x5,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8,0x8, // f0..ff
0x0,0x1,0x2,0x3,0x5,0x8,0x7,0x1,0x1,0x1,0x4,0x6,0x1,0x1,0x1,0x1, // s0..s0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1, // s1..s2
1,2,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1, // s3..s4
1,2,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,3,1,1,1,1,1,1, // s5..s6
1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};
uint32_t type = utf8d[byte];
uint32_t new_state = utf8d[256 + state * 16 + type];
state = new_state;
return (state == 0);
}
```
How can I resolve this? If you propose a fix, please make it concise.| Utf8State scanner_state; | ||
| size_t valid_upto = 0; | ||
| for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) { | ||
| scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i])); | ||
| if (scanner_state.state == 0) { | ||
| valid_upto = i + 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
scanner_state should be reused or reset across iterations - creating a new Utf8State inside the loop means each token's UTF-8 validation starts fresh, but partial_utf8_buffer can contain incomplete sequences from previous iterations
| Utf8State scanner_state; | |
| size_t valid_upto = 0; | |
| for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) { | |
| scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i])); | |
| if (scanner_state.state == 0) { | |
| valid_upto = i + 1; | |
| } | |
| } | |
| Utf8State scanner_state; | |
| size_t valid_upto = 0; | |
| for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) { | |
| bool complete = scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i])); | |
| if (complete) { | |
| valid_upto = i + 1; | |
| } | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 517:524
Comment:
scanner_state should be reused or reset across iterations - creating a new `Utf8State` inside the loop means each token's UTF-8 validation starts fresh, but `partial_utf8_buffer` can contain incomplete sequences from previous iterations
```suggestion
Utf8State scanner_state;
size_t valid_upto = 0;
for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
bool complete = scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
if (complete) {
valid_upto = i + 1;
}
}
```
How can I resolve this? If you propose a fix, please make it concise.|
scanner state is being used correctly. Incomplete sequences are considered in the next round. all is good from my side |
|
LGTM, merging. Thanks @VyasGuru |
Optimised llamacpp_backend.cpp
Changes
Results
Important
Optimized
llamacpp_backend.cppwith DFA-based UTF-8 validation and efficient stop sequence detection, improving performance by 2.8x ~ 3.3x.is_valid_utf8()withUtf8Statestruct using DFA for UTF-8 validation.generate_stream().stop_windowto efficiently detect stop sequences.stop_windowandpartial_utf8_bufferto reduce allocations.This description was created by
for a37fb27. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Bug Fixes
Refactor
Greptile Overview
Greptile Summary
This PR implements significant performance optimizations to the LlamaCpp text generation backend, achieving 2.8x-3.3x speedup through three key improvements:
Key Changes:
reserve()) to prevent repeated allocations during string concatenationImplementation Details:
The optimizations maintain correct behavior for stop sequence detection and UTF-8 handling while dramatically reducing memory allocations and string scanning overhead.
Confidence Score: 4/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant Caller participant generate_stream participant Sampler as llama_sampler participant Utf8State participant StopWindow as Rolling Window participant Callback Caller->>generate_stream: TextGenerationRequest activate generate_stream generate_stream->>generate_stream: Initialize stop_window (reserve MAX_STOP_LEN*2) generate_stream->>generate_stream: Initialize partial_utf8_buffer (reserve 8) loop For each token (up to max_tokens) generate_stream->>Sampler: sample next token Sampler-->>generate_stream: new_token_id generate_stream->>generate_stream: Convert token to string generate_stream->>generate_stream: Append to partial_utf8_buffer generate_stream->>Utf8State: Create scanner_state loop Scan UTF-8 bytes with DFA generate_stream->>Utf8State: process(byte) Utf8State-->>generate_stream: state == 0 (complete char) generate_stream->>generate_stream: Track valid_upto position end alt valid_upto > 0 generate_stream->>generate_stream: Extract valid_chunk generate_stream->>StopWindow: Append valid_chunk generate_stream->>generate_stream: Erase from partial_utf8_buffer loop Check stop sequences (O(1) in rolling window) generate_stream->>StopWindow: find(stop_seq) alt Stop sequence found generate_stream->>Callback: Send text before stop generate_stream-->>Caller: Break (stop detected) end end alt stop_window.size() > MAX_STOP_LEN generate_stream->>generate_stream: Calculate safe_len generate_stream->>Callback: Send safe text generate_stream->>StopWindow: Erase sent portion end end generate_stream->>Sampler: Decode next token batch end generate_stream->>Callback: Send remaining stop_window deactivate generate_stream generate_stream-->>Caller: Generation complete(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!