Conversation
* feat(lora): add LoRA adapter support across SDK + demo app
Implement LoRA (Low-Rank Adaptation) adapter hot-swapping for llama.cpp
backend across all 6 SDK layers (C++ -> C API -> Component -> JNI ->
Kotlin Bridge -> Kotlin Public API).
- Add load/remove/clear/query LoRA adapter operations
- Use vtable dispatch in component layer to decouple librac_commons
from librac_backend_llamacpp (fixes linker errors)
- Add LoRA vtable entries to rac_llm_service_ops_t
- Fix AttachCurrentThread cast for Android NDK C++ JNI build
- Add RunAnyWhereLora Android demo app with Material 3 Q&A UI
- Add comprehensive implementation docs with C/C++ API reference
* feat(ci): add selectable build targets to Build All workflow + fix Swift concurrency errors
Rewrite build-all-test.yml with 9 boolean checkbox inputs so each build
target can be toggled independently from the GitHub Actions UI:
- C++ Android Backends (arm64-v8a, armeabi-v7a, x86_64 matrix)
- C++ iOS Backends (XCFramework)
- Kotlin SDK (JVM + Android)
- Swift SDK (iOS/macOS)
- Web SDK (TypeScript)
- Flutter SDK (Dart analyze via Melos)
- React Native SDK (TypeScript via Lerna)
- Android Example Apps (RunAnywhereAI + RunAnyWhereLora)
- IntelliJ Plugin
Fix two Swift strict-concurrency errors that fail the Swift SDK build:
- LiveTranscriptionSession: add @unchecked Sendable (safe because class
is @mainactor, all access serialized)
- RunAnywhere+VisionLanguage: add Sendable conformance to rac_vlm_image_t
so the C struct can cross the Task boundary in the streaming builder;
simplify StreamingCollector to start timing at init
* fix(swift): resolve strict concurrency errors in LiveTranscriptionSession and VLM streaming
LiveTranscriptionSession.swift:
- Replace [weak self] captures with strong `let session = self` before
closures to avoid captured var in @Sendable/@task contexts (class is
@mainactor @unchecked Sendable so strong ref is safe, bounded by
stream lifecycle)
- Wrap deprecated startStreamingTranscription call in @available helper
to silence deprecation warning until migration to transcribeStream API
RunAnywhere+VisionLanguage.swift:
- Add `let capturedCImage = cImage` before AsyncThrowingStream closure
so the Task captures an immutable let instead of a mutable var
- Add `extension rac_vlm_image_t: @unchecked Sendable {}` for the C
struct to cross Task concurrency boundaries safely
- Simplify StreamingCollector to initialize startTime at init instead
of requiring a separate async start() call
* fix(jni): address CodeRabbit review findings in LoRA JNI functions
- Replace raw -1 returns with RAC_ERROR_INVALID_HANDLE/RAC_ERROR_INVALID_ARGUMENT
to match codebase error handling conventions
- Use getCString() helper instead of raw GetStringUTFChars/ReleaseStringUTFChars
- Add missing result logging to racLlmComponentRemoveLora and racLlmComponentClearLora
- Use rac_free() instead of free() in racLlmComponentGetLoraInfo for consistency
- Clarify LoRA adapter memory ownership comments (adapters freed automatically
with model per llama.cpp b8011 API — llama_adapter_lora_free is deprecated)
* ios initial changes * minimal sample needed to test lora * updating docs * addressed the comments
First version for Optimised RAG. Not polished yet, Once tested, I'll microoptimise, bench, and finish.
Optimised RAG Prototype
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 9e4f2df in 17 seconds. Click for details.
- Reviewed
1515lines of code in12files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_MeSXsFnqYzwJFcTN
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughThis PR adds context-aware generation capabilities to the LlamaCpp backend, including KV-cache management, confidence probing, and system prompt injection. It extends the RAG pipeline with sentence-level text splitting and adaptive context accumulation based on confidence scoring, while optimizing vector store performance through reduced expansion parameters and i8 quantization. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RAGBackend
participant TextGen as TextGenerator
participant VectorStore
participant Chunker
participant LLM as LlamaCpp Engine
Client->>RAGBackend: query(text)
RAGBackend->>TextGen: inject_system_prompt(kICLSystemPrompt)
TextGen->>LLM: tokenize & cache system prompt
RAGBackend->>VectorStore: retrieve top_k candidates
VectorStore-->>RAGBackend: parent chunks + similarities
RAGBackend->>Chunker: split_into_sentences(chunk)
Chunker-->>RAGBackend: sentence array
loop for each sentence
RAGBackend->>TextGen: append_context(sentence)
TextGen->>LLM: tokenize & cache sentence
RAGBackend->>TextGen: probe_confidence(context, query)
TextGen->>LLM: forward pass on "Yes"/"No" logits
LLM-->>TextGen: confidence score (0.0-1.0)
TextGen-->>RAGBackend: confidence float
alt confidence >= threshold
RAGBackend->>RAGBackend: accumulate_context(sentence)
else confidence < threshold OR partial context
RAGBackend->>TextGen: clear_context()
TextGen->>LLM: reset KV cache
end
end
RAGBackend->>TextGen: generate_from_context(accumulated_context + query_suffix)
TextGen->>LLM: generate with accumulated KV state
LLM-->>TextGen: generated text + metadata
TextGen-->>RAGBackend: GenerationResult
RAGBackend->>RAGBackend: enrich metadata (sentences_used, confidence, sources)
RAGBackend-->>Client: GenerationResult with provenance
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| for (const auto& sentence_result : search_results) { | ||
| const std::string& sentence_text = sentence_result.text; | ||
| std::string append_text = (sentences_used == 0) ? sentence_text : ("\n" + sentence_text); | ||
| text_generator->append_context(append_text); |
There was a problem hiding this comment.
Return value not checked - if append_context() fails (e.g., context full), loop continues anyway, leading to incorrect confidence probing
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/rag_backend.cpp
Line: 401
Comment:
Return value not checked - if `append_context()` fails (e.g., context full), loop continues anyway, leading to incorrect confidence probing
How can I resolve this? If you propose a fix, please make it concise.| // Create metric for cosine similarity. Using i8 instead of float to save on RAM(quality isnt affected much) | ||
| metric_punned_t metric( | ||
| static_cast<std::size_t>(config.dimension), | ||
| metric_kind_t::cos_k, |
There was a problem hiding this comment.
Quantization change from f32 to i8 significantly reduces memory but may affect retrieval quality. Verify that recall/precision metrics meet requirements before deploying.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/vector_store_usearch.cpp
Line: 57-60
Comment:
Quantization change from f32 to i8 significantly reduces memory but may affect retrieval quality. Verify that recall/precision metrics meet requirements before deploying.
How can I resolve this? If you propose a fix, please make it concise.| size_t expansion_add = 40; // Construction search depth ( even a smaller one should be good enough) | ||
| size_t expansion_search = 20; // Query search depth( even a smaller one should be good enough) |
There was a problem hiding this comment.
HNSW parameters reduced by ~70% (expansion_add: 128→40, expansion_search: 64→20). This trades recall quality for speed. Verify search quality meets requirements.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/vector_store_usearch.h
Line: 51-52
Comment:
HNSW parameters reduced by ~70% (expansion_add: 128→40, expansion_search: 64→20). This trades recall quality for speed. Verify search quality meets requirements.
How can I resolve this? If you propose a fix, please make it concise.| return text.length() / config_.chars_per_token; | ||
| } | ||
|
|
||
| // used for focus mode in RAG(not final yet, will minmax this further, but this is a working version) |
There was a problem hiding this comment.
WIP comment ("not final yet, will minmax this further"). Remove before merging or track in an issue.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/rag_chunker.cpp
Line: 32
Comment:
WIP comment ("not final yet, will minmax this further"). Remove before merging or track in an issue.
How can I resolve this? If you propose a fix, please make it concise.| static const std::string kICLSystemPrompt = | ||
| "You are a question-answering assistant. Given context passages and a question, " | ||
| "determine if the passages contain enough information to answer the question.\n\n" | ||
| "Example 1 (Sufficient context):\n" | ||
| "Context: \"The Eiffel Tower was completed in 1889 for the World's Fair in Paris.\"\n" | ||
| "Question: \"When was the Eiffel Tower built?\"\n" |
There was a problem hiding this comment.
ICL prompt examples show explicit context in the prompt format, but actual probing relies on implicit KV cache context. This mismatch may reduce effectiveness of confidence probing.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/rag_backend.cpp
Line: 18-23
Comment:
ICL prompt examples show explicit context in the prompt format, but actual probing relies on implicit KV cache context. This mismatch may reduce effectiveness of confidence probing.
How can I resolve this? If you propose a fix, please make it concise.| std::vector<ScoredSentence> scored_sentences; | ||
|
|
||
| for (const auto& parent : parent_chunks) { | ||
| auto sentences = chunker->split_into_sentences(parent.text); | ||
| LOGI("Parent chunk '%s' split into %zu sentences", parent.chunk_id.c_str(), sentences.size()); | ||
|
|
||
| for (const auto& sentence : sentences) { | ||
| if (sentence.size() < 3) { | ||
| continue; | ||
| } | ||
|
|
||
| try { | ||
| auto sentence_embedding = embedding_provider->embed(sentence); | ||
| float sim = cosine_similarity(query_embedding, sentence_embedding); | ||
|
|
||
| scored_sentences.push_back({ | ||
| sentence, | ||
| sim, | ||
| parent.chunk_id, | ||
| parent.metadata | ||
| }); | ||
| } catch (const std::exception& e) { | ||
| LOGE("Failed to embed sentence, skipping: %s", e.what()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Sentence-level embedding for every sentence across 5 parent chunks creates significant computational overhead compared to the previous chunk-only approach. Consider caching sentence embeddings if parent chunks are frequently accessed.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/rag_backend.cpp
Line: 228-252
Comment:
Sentence-level embedding for every sentence across 5 parent chunks creates significant computational overhead compared to the previous chunk-only approach. Consider caching sentence embeddings if parent chunks are frequently accessed.
How can I resolve this? If you propose a fix, please make it concise.| size_t embedding_dimension = 384; | ||
| size_t top_k = 3; | ||
| size_t top_k = 10; //Need to get Golden document | ||
| float similarity_threshold = 0.15f; |
There was a problem hiding this comment.
Move inline comment to separate line above for better readability.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/rag/rag_backend.h
Line: 29
Comment:
Move inline comment to separate line above for better readability.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
Brief description of the changes made.
Type of Change
Testing
Platform-Specific Testing (check all that apply)
Swift SDK / iOS Sample:
Kotlin SDK / Android Sample:
Flutter SDK / Flutter Sample:
React Native SDK / React Native Sample:
Playground:
Web SDK / Web Sample:
Labels
Please add the appropriate label(s):
SDKs:
Swift SDK- Changes to Swift SDK (sdk/runanywhere-swift)Kotlin SDK- Changes to Kotlin SDK (sdk/runanywhere-kotlin)Flutter SDK- Changes to Flutter SDK (sdk/runanywhere-flutter)React Native SDK- Changes to React Native SDK (sdk/runanywhere-react-native)Web SDK- Changes to Web SDK (sdk/runanywhere-web)Commons- Changes to shared native code (sdk/runanywhere-commons)Sample Apps:
iOS Sample- Changes to iOS example app (examples/ios)Android Sample- Changes to Android example app (examples/android)Flutter Sample- Changes to Flutter example app (examples/flutter)React Native Sample- Changes to React Native example app (examples/react-native)Web Sample- Changes to Web example app (examples/web)Checklist
Screenshots
Attach relevant UI screenshots for changes (if applicable):
Important
Enhance RAG backend with adaptive text generation, sentence-level chunking, and optimized vector store for improved performance and accuracy.
probe_confidence(),inject_system_prompt(),append_context(),generate_from_context(), andclear_context()methods toLlamaCppTextGenerationinllamacpp_backend.cppandLlamaCppGeneratorinllamacpp_generator.cpp.RAGBackendinrag_backend.cppusing new text generation methods.split_into_sentences()toDocumentChunkerinrag_chunker.cppfor sentence-level chunking.RAGBackendinrag_backend.cppto use sentence-level chunking for more focused search results.i8quantization inVectorStoreUSearchinvector_store_usearch.cppfor reduced memory usage.vector_store_usearch.cppto ensure top-K results are returned.RAGBackendConfiginrag_backend.hto increasetop_kdefault to 10.This description was created by
for 9e4f2df. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Release Notes
New Features
Performance Improvements
Greptile Summary
Implements adaptive RAG optimization with sentence-level retrieval and confidence-based context accumulation. The system now retrieves parent chunks, splits them into sentences, embeds sentences individually, then incrementally adds sentences to KV cache until confidence threshold is reached via logit probing.
Key changes:
inject_system_prompt,append_context,probe_confidence,generate_from_context,clear_context) for stateful generationtop_kfrom 3 to 10Critical issues:
rag_backend.cpp:401- Missing error handling:append_context()return value not checked, loop continues even if context append failsrag_chunker.cppsuggests incomplete implementationPerformance implications:
Confidence Score: 3/5
vector_store_usearch.cpp/h(quantization/HNSW changes affect retrieval quality) andrag_backend.cpp(missing error handling in adaptive loop)Important Files Changed
Sequence Diagram
sequenceDiagram participant User participant RAGBackend participant VectorStore participant Embedder participant Generator participant KVCache User->>RAGBackend: query(text) RAGBackend->>Generator: clear_context() Generator->>KVCache: Clear all state RAGBackend->>Generator: inject_system_prompt(ICL) Generator->>KVCache: Add ICL prompt at pos 0 RAGBackend->>Embedder: embed(query) Embedder-->>RAGBackend: query_embedding RAGBackend->>VectorStore: search(query_embedding, top_k=5) Note over VectorStore: Retrieve 5 parent chunks VectorStore-->>RAGBackend: parent_chunks[5] loop For each parent chunk RAGBackend->>RAGBackend: split_into_sentences() loop For each sentence RAGBackend->>Embedder: embed(sentence) Embedder-->>RAGBackend: sentence_embedding RAGBackend->>RAGBackend: score = cosine_similarity() end end RAGBackend->>RAGBackend: sort sentences by similarity Note over RAGBackend: Keep top 10 sentences loop Until confidence > 0.8 OR all sentences added RAGBackend->>Generator: append_context(sentence) Generator->>KVCache: Append sentence tokens RAGBackend->>Generator: probe_confidence("", query) Generator->>KVCache: Add probe tokens temporarily Generator->>Generator: Extract Yes/No logits Generator->>Generator: Compute softmax confidence Generator->>KVCache: Remove probe tokens Generator-->>RAGBackend: confidence_score alt confidence > 0.8 Note over RAGBackend: Threshold reached, stop end end RAGBackend->>Generator: generate_from_context(query_suffix) Generator->>KVCache: Add query tokens loop Token generation Generator->>Generator: Sample next token Generator->>KVCache: Append token end Generator-->>RAGBackend: generated_text RAGBackend-->>User: result + metadataLast reviewed commit: 9e4f2df