Skip to content

C++ Commons: Benchmark Timing Infrastructure — Complete Implementation#343

Open
AmanSwar wants to merge 1297 commits intoRunanywhereAI:mainfrom
AmanSwar:Benchmark_Basics
Open

C++ Commons: Benchmark Timing Infrastructure — Complete Implementation#343
AmanSwar wants to merge 1297 commits intoRunanywhereAI:mainfrom
AmanSwar:Benchmark_Basics

Conversation

@AmanSwar
Copy link
Copy Markdown

@AmanSwar AmanSwar commented Feb 10, 2026

Summary

Complete benchmark timing instrumentation for the RunAnywhere C++ Commons layer. Captures 6 precise timestamps during LLM inference (t0/t2/t3/t4/t5/t6), extended device metrics interfaces, JSON/CSV logging, statistical analysis with percentiles, and 29 unit tests — all with zero overhead when not benchmarking.

What Changed (This Commit)

Bug Fixes

  • Status/error semantic mismatch: Added error_code field to rac_benchmark_timing_tstatus holds RAC_BENCHMARK_STATUS_* (0-3), error_code holds the rac_result_t
  • Incomplete error paths: All 3 error paths in llm_component.cpp now set t6_request_end_ms and error_code (previously only set status)
  • Lost prompt token count: rac_llm_llamacpp.cpp was passing nullptr for out_prompt_tokens — now captures actual count from backend
  • JNI local ref leak: Added DeleteLocalRef(callbackClass) after GetMethodID
  • JNI JSON missing field: Added benchmark_error_code to JSON output
  • Pre-existing build fix: Added missing #include <algorithm> in model_paths.cpp

New Features

Extended Metrics (interfaces only)

  • rac_benchmark_metrics.hrac_benchmark_extended_metrics_t struct (memory, CPU temp, battery, GPU, thermal state)
  • Callback-based provider pattern: platform SDKs (Swift/Kotlin) register a provider during init
  • Lock-free provider registry implementation

Automatic Logging

  • rac_benchmark_timing_to_json() — Full timing + derived metrics (TTFT, prefill, decode, E2E, TPS)
  • rac_benchmark_timing_to_csv() — CSV header/row serialization
  • rac_benchmark_timing_log() — Log summary via RAC logging system

Statistical Analysis

  • rac_benchmark_stats_create/destroy/record/reset/count — Opaque collector API
  • rac_benchmark_stats_get_summary() — P50/P95/P99 percentiles, mean, stddev, min/max for TTFT, prefill, decode TPS, E2E
  • Outlier detection (E2E > mean + 2*stddev)
  • rac_benchmark_stats_summary_to_json() — Export summary as JSON
  • Thread-safe via mutex, only records successful observations

Unit Tests — 29/29 Passing

Suite Tests Coverage
MonotonicClock 5 Non-negative, monotonic, accuracy, distinct, thread safety (8 threads × 10K calls)
TimingStruct 6 Init zeroes all fields, null safety, status codes, timestamp ordering, error paths, derived metrics
BenchmarkLog 6 JSON fields, null handling, CSV header/row, logging no-crash
BenchmarkStats 12 Create/destroy, record/count, reset, percentiles, outlier detection, JSON export

Files Changed (25 files, +2504 lines)

New Files (11)

File Purpose
include/rac/core/rac_benchmark_metrics.h Extended metrics struct + provider API
include/rac/core/rac_benchmark_log.h JSON/CSV/log serialization API
include/rac/core/rac_benchmark_stats.h Stats collector API + summary struct
src/core/rac_benchmark_metrics.cpp Lock-free metrics provider registry
src/core/rac_benchmark_log.cpp Logging implementation with derived metrics
src/core/rac_benchmark_stats.cpp Stats collector with percentiles + outlier detection
tests/CMakeLists.txt GTest via FetchContent, test discovery
tests/benchmark/test_monotonic_clock.cpp Clock tests
tests/benchmark/test_timing_struct.cpp Struct tests
tests/benchmark/test_benchmark_log.cpp Logging tests
tests/benchmark/test_benchmark_stats.cpp Stats tests

Modified Files (7)

File Change
include/rac/core/rac_benchmark.h Added error_code field
src/features/llm/llm_component.cpp Set t6 + error_code on all error paths
src/backends/llamacpp/rac_llm_llamacpp.cpp Capture prompt tokens from backend
src/features/llm/rac_llm_service.cpp Clarifying comment on fallback path
src/jni/runanywhere_commons_jni.cpp Fix local ref leak + add error_code to JSON
src/infrastructure/model_management/model_paths.cpp Add missing #include <algorithm>
CMakeLists.txt Add 3 new source files to RAC_CORE_SOURCES

Validation Checklist

  • All 6 timestamps captured correctly (t0/t4/t6 in component, t2/t3/t5 in backend)
  • Timestamps monotonically increasing (verified by unit test)
  • Token counts match actual generation (fixed prompt token capture)
  • Status code set correctly on all paths (fixed with error_code field)
  • NULL timing pointer has zero overhead
  • LlamaCPP backend captures backend timestamps
  • Component timestamps work for all backends (fallback documented)
  • No memory leaks (stack-allocated by caller, no heap allocs)
  • Thread-safe operation (verified: 8 threads × 10K calls)
  • Cross-platform build tested (Linux verified, iOS/Android via CI)

Build & Test

cd sdk/runanywhere-commons
cmake -B build-test -DRAC_BUILD_TESTS=ON -DRAC_BUILD_BACKENDS=OFF -DCMAKE_BUILD_TYPE=Debug
cmake --build build-test
./build-test/tests/rac_benchmark_tests
# [PASSED] 29 tests
----
<!-- ELLIPSIS_HIDDEN -->
> [!IMPORTANT]
> Added comprehensive benchmark timing and metrics infrastructure to capture, analyze, and log performance data for LLM inference operations across the SDK.
> 
>   ## Benchmark Timing Infrastructure
>   - **Core Timing API** (`rac_benchmark.h`, `rac_benchmark.cpp`):
>     - New `rac_benchmark_timing_t` struct to capture timestamps at key inference points: request start, prefill start/end, first token, last token, and request end.
>     - Added `rac_monotonic_now_ms()` function for monotonic time measurement from process start.
>     - Added `rac_benchmark_timing_init()` to initialize timing structs.
>     - Includes status codes for SUCCESS, ERROR, TIMEOUT, and CANCELLED states.
>   - **Benchmark Logging** (`rac_benchmark_log.h`, `rac_benchmark_log.cpp`):
>     - `rac_benchmark_timing_to_json()` serializes timing data to JSON with derived metrics (TTFT, prefill time, decode time, throughput).
>     - `rac_benchmark_timing_to_csv()` serializes to CSV format with optional header generation.
>     - `rac_benchmark_timing_log()` logs results via RAC logging system.
>   - **Extended Metrics** (`rac_benchmark_metrics.h`, `rac_benchmark_metrics.cpp`):
>     - New `rac_benchmark_extended_metrics_t` struct for device/platform metrics: memory usage, peak memory, CPU temperature, battery level, GPU utilization, and thermal state.
>     - `rac_benchmark_set_metrics_provider()` registers platform-specific metrics collection callback.
>     - `rac_benchmark_capture_metrics()` captures current device metrics using registered provider.
>     - Uses atomic operations and double-buffering for lock-free provider access.
>   - **Statistical Analysis** (`rac_benchmark_stats.h`, `rac_benchmark_stats.cpp`):
>     - `BenchmarkStatsCollector` class for recording and analyzing timing observations.
>     - Computes percentiles (P50, P95, P99), min/max, mean, standard deviation, and outlier detection.
>     - Derives metrics: TTFT, prefill time, decode throughput, end-to-end latency.
>     - JSON serialization of summary statistics.
>   ## LLM Integration
>   - **LlamaCpp Backend** (`llamacpp_backend.h`, `llamacpp_backend.cpp`, `rac_llm_llamacpp.h`, `rac_llm_llamacpp.cpp`):
>     - New `rac_llm_llamacpp_generate_stream_with_timing()` function captures timing at prefill start, prefill end, and last token points.
>     - Integrated into backend vtable via `llamacpp_vtable_generate_stream_with_timing()`.
>   - **LLM Component** (`rac_llm_component.h`, `llm_component.cpp`):
>     - New `rac_llm_component_generate_stream_with_timing()` function with optional benchmark timing support.
>     - Captures t0 (request start), t4 (first token), and t6 (request end) on component side.
>     - Zero overhead when timing is not requested.
>   - **LLM Service** (`rac_llm_service.h`, `rac_llm_service.cpp`):
>     - Added `generate_stream_with_timing` function pointer to service operations.
>     - New `rac_llm_generate_stream_with_timing()` public API with fallback to regular streaming if backend doesn't support timing.
>   - **JNI Bindings** (`runanywhere_commons_jni.cpp`):
>     - New `racLlmComponentGenerateStreamWithTiming()` JNI function for Kotlin integration.
>     - Returns JSON response with generated text, token counts, and detailed benchmark timing fields.
>   ## Build and Tests
>   - **Build Configuration** (`CMakeLists.txt`):
>     - Added four new source files to RAC_CORE_SOURCES: `rac_benchmark.cpp`, `rac_benchmark_metrics.cpp`, `rac_benchmark_log.cpp`, `rac_benchmark_stats.cpp`.
>   - **Unit Tests** (`tests/CMakeLists.txt`, `tests/benchmark/`):
>     - New test suite with GoogleTest framework covering:
>       - Monotonic clock functionality and thread safety.
>       - Timing struct initialization and field validation.
>       - Benchmark log JSON/CSV serialization.
>       - Benchmark stats collection, percentile calculation, and outlier detection.
>   - **Minor Updates**:
>     - Added `#include <algorithm>` to `model_paths.cpp`.
> 
> <sup>This description was created by </sup>[<img alt="Ellipsis" src="https://img.shields.io/badge/Ellipsis-blue?color=175173">](https://www.ellipsis.dev?ref=RunanywhereAI%2Frunanywhere-sdks&utm_source=github&utm_medium=referral)<sup> for 148cf39da68f84b8063551f753b24b7ccb64f9b3. You can [customize](https://app.ellipsis.dev/RunanywhereAI/settings/summaries) this summary. It will automatically update as commits are pushed.</sup>
<!-- ELLIPSIS_HIDDEN -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added opt-in LLM inference benchmarking with detailed timing (TTFT, prefill, decode, E2E) and optional streaming-with-timing APIs.
  * Added serialization and logging of benchmark data (JSON, CSV) and a JNI bridge that can return timing-augmented results.
  * Added system/platform metrics provider and a stats collector to aggregate percentiles, outliers and throughput.

* **Tests**
  * Added comprehensive unit tests for clock, timing structs, serialization, metrics and stats.

* **Chores**
  * Integrated new benchmark sources into build configuration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

<!-- greptile_comment -->

<h2>Greptile Overview</h2>

<h3>Greptile Summary</h3>

Complete benchmark timing infrastructure for RunAnywhere C++ Commons layer. Captures 6 precise timestamps (t0/t2/t3/t4/t5/t6) during LLM inference, extended device metrics interfaces, JSON/CSV logging, statistical analysis with percentiles, and 29 comprehensive unit tests.

**Key Changes:**
- **Core timing infrastructure**: Monotonic clock API using `std::chrono::steady_clock`, thread-safe timing struct with status and error_code fields
- **Bug fixes**: Fixed 3 error paths in `llm_component.cpp` to properly set t6 and error_code, fixed lost prompt token count in llamacpp backend, fixed JNI local ref leak
- **Extended metrics**: Lock-free provider registry for platform-specific device metrics (memory, CPU temp, battery, GPU, thermal state)
- **Logging & serialization**: JSON/CSV export with derived metrics (TTFT, prefill, decode TPS, E2E latency)
- **Statistical analysis**: P50/P95/P99 percentiles, mean, stddev, outlier detection (E2E > mean + 2*stddev)
- **Comprehensive tests**: 29 tests covering monotonic clock, timing struct, logging, stats - including thread safety test with 8 threads × 10K calls

**Architecture Highlights:**
- Zero overhead when not benchmarking (opt-in via pointer parameter)
- Clean separation: Component captures t0/t4/t6, backend captures t2/t3/t5
- Thread-safe: Atomic operations for metrics provider, mutex for stats collector
- Proper memory management: Stack-allocated structs, heap allocations only for JSON/CSV strings

<h3>Confidence Score: 5/5</h3>

- This PR is safe to merge with high confidence - well-designed architecture, comprehensive tests, and proper bug fixes
- Score of 5 reflects exceptional implementation quality: (1) Complete test coverage with 29 passing tests including thread safety validation, (2) Clean architecture with zero overhead when not benchmarking, (3) Critical bug fixes properly addressing error paths and memory leaks, (4) Thread-safe design using appropriate synchronization primitives, (5) Excellent documentation and clear API design following C best practices
- No files require special attention - all implementations are production-ready

<h3>Important Files Changed</h3>




| Filename | Overview |
|----------|----------|
| sdk/runanywhere-commons/include/rac/core/rac_benchmark.h | Added core benchmark timing struct with 6 timestamps (t0-t6), status codes, and error handling - clean API design |
| sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp | Implemented serialization with safe diff calculations and proper memory management - includes TTFT, prefill, decode TPS, E2E metrics |
| sdk/runanywhere-commons/src/core/rac_benchmark_stats.cpp | Implemented stats collector with percentile calculations and outlier detection - thread-safe with mutex, proper memory management |
| sdk/runanywhere-commons/src/features/llm/llm_component.cpp | Added generate_stream_with_timing function capturing t0/t4/t6, fixed all 3 error paths to set t6 and error_code properly |
| sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp | Implemented backend timing capture for t2/t3/t5 during prefill and decode phases - properly integrated with llama.cpp |
| sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp | Added JNI wrapper with timing, fixed DeleteLocalRef leak, added benchmark_error_code to JSON output |

</details>



<h3>Sequence Diagram</h3>

```mermaid
sequenceDiagram
    participant User
    participant Component as LLM Component
    participant Backend as LlamaCpp Backend
    participant Timing as Benchmark Timing

    User->>Component: generate_stream_with_timing(prompt, timing_out)
    Component->>Timing: t0 = rac_monotonic_now_ms() [request start]
    Component->>Backend: rac_llm_generate_stream_with_timing()
    
    Backend->>Timing: t2 = rac_monotonic_now_ms() [prefill start]
    Backend->>Backend: llama_decode(prompt_batch)
    Backend->>Timing: t3 = rac_monotonic_now_ms() [prefill end]
    
    loop Token Generation
        Backend->>Backend: llama_sampler_sample()
        Backend->>Component: token_callback(token)
        alt First Token
            Component->>Timing: t4 = rac_monotonic_now_ms() [first token]
        end
        Component->>User: on_token(token)
    end
    
    Backend->>Timing: t5 = rac_monotonic_now_ms() [last token]
    Backend-->>Component: success
    
    Component->>Timing: t6 = rac_monotonic_now_ms() [request end]
    Component->>Timing: Set prompt_tokens, output_tokens, status
    Component->>User: complete_callback(result, timing)

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

shubhammalhotra28 and others added 30 commits January 7, 2026 09:38
- RABackendLlamaCPP checksum updated for core-v0.1.4
- RABackendONNX checksum updated for core-v0.1.4
- Kotlin SDK version: 0.1.3 → 0.1.4
- runanywhere.coreVersion: 0.1.3 → 0.1.4
- runanywhere.commonsVersion: 0.1.3 → 0.1.4

This file was overriding the defaults in build.gradle.kts
…ce agent capabilities

- Added base64 encoding/decoding utilities for audio data handling.
- Implemented memory usage retrieval for iOS and Android platforms.
- Introduced global component handles for LLM, STT, TTS, and VAD to streamline resource management.
- Enhanced voice agent functionality with methods for processing voice turns and generating responses.
- Updated TypeScript definitions for TTS output to include additional fields for backward compatibility.
- Improved error handling and logging across various components for better debugging and user experience.
- Updated ModelSelectionSheet to check for valid CPU cores before rendering.
- Added consumer rules for ArchiveUtility in Android to ensure proper JNI access.
- Enhanced cpp-adapter.cpp with logging and exception handling for better debugging during JNI calls.
- Refactored ArchiveUtility.kt to utilize Apache Commons Compress for efficient tar.gz extraction and improved error logging.
- Cleaned up unused code and improved overall structure for better maintainability.
- Updated COMMONS_VERSION, LLAMACPP_VERSION, ONNX_BACKEND_VERSION, and CORE_VERSION to 0.1.4 in respective podspec and build.gradle files.
- Adjusted version files in iOS and Android frameworks to reflect the new versioning.
- Ensured consistency in versioning across all relevant components for better maintainability.
- Updated .gitignore to exclude downloaded XCFrameworks and native libraries for iOS and Android.
- Removed obsolete version files and header files related to LlamaCpp and ONNX packages.
- Ensured that the project structure is cleaner by eliminating unnecessary files, improving maintainability.
…d_rearch

[Kotlin-SDK] [Kotlin-Example] KMP SDK refactor and aligning to iOS
- Downgraded 'record' dependency version from ^6.1.2 to ^6.1.0 in pubspec.yaml.
- Enhanced error handling for plugin registration in GeneratedPluginRegistrant.java and GeneratedPluginRegistrant.m.
- Removed unused ChangeNotifierProvider from runanywhere_ai_app.dart.
- Updated model download progress handling in model_manager.dart and model_selection_sheet.dart.
- Cleaned up unused code and improved comments across various files for better maintainability.
- Added new files for ONNX and LlamaCpp plugins, including their respective podspecs and Gradle configurations.
- Removed deprecated native libraries and related files from the project.
- Introduced SDKLogger for structured logging in both iOS and Android.
- Replaced print and NSLog statements with SDKLogger methods for consistency.
- Added SwiftLint rules to enforce SDKLogger usage in Swift files.
- Updated various components to utilize the new logging system, enhancing error tracking and debugging capabilities.
- Configured logging settings in package.json to disallow console.log usage, ensuring all logs are routed through SDKLogger.
- Introduced a new Lefthook configuration file for managing Git hooks.
- Updated Podfile to use static linkage for RACommons symbols and improved comments for clarity.
- Removed deprecated entries from project.pbxproj to streamline the build process.
- Enhanced model management in runanywhere_ai_app.dart by simplifying model registration and unloading logic.
- Cleaned up unused code and improved comments across various files for better maintainability.
- Added new markdown files for Flutter SDK to Swift SDK migration analysis and parity migration plan.
…ed stability

- Added a new isolate for LLM generation to prevent heap corruption from C++ background threads.
- Introduced a helper class for isolate communication and error handling.
- Updated the generate method to utilize the new isolate approach, enhancing performance and reliability.
- Enhanced the LLM generation process to support real-time streaming of tokens using isolates.
- Introduced a new `LLMStreamingResult` class to encapsulate the streaming output, final result, and cancellation functionality.
- Updated the `generateStream` method to return a stream of tokens and a future for final metrics, allowing for better user interaction and control.
- Improved error handling and communication between isolates for robust streaming operations.
- Refactored ModelManager to reduce complexity by removing event listeners and internal state management.
- Updated methods to directly interact with the RunAnywhere SDK for model loading, unloading, and state checks.
- Enhanced chat and voice features to utilize SDK state directly, improving synchronization and reducing reliance on ModelManager.
- Cleaned up unused code and improved comments for better maintainability across various files.
- Added TTS synthesis capabilities in the DartBridgeTTS class, allowing for speech generation from text with customizable rate, pitch, and volume.
- Enhanced the TTSResult class to encapsulate audio samples, sample rate, and duration for synthesized speech.
- Integrated TTS synthesis into the RunAnywhere class, providing a straightforward API for users to synthesize speech.
- Implemented audio playback from synthesized samples, ensuring seamless user experience.
- Updated comments and documentation for clarity and maintainability across the TTS-related files.
- Implemented checks for voice agent readiness in the SDK, ensuring all required components (STT, LLM, TTS) are loaded before starting a voice session.
- Refactored voice session handling to utilize the new SDK API for starting sessions, improving integration and reliability.
- Introduced detailed error handling for voice agent initialization failures, providing clearer feedback to users.
- Added new types for voice agent component states, enhancing the API for checking model readiness.
- Updated documentation and comments for clarity and maintainability across voice-related files.
- Added a check to determine if audio data is already in WAV format before conversion, enhancing efficiency.
- Updated the DartBridgeVoiceAgent to reflect that synthesized audio is now in WAV format, eliminating unnecessary conversions.
- Refactored the RunAnywhere class to directly use WAV audio data for playback, streamlining the audio processing workflow.
- Improved documentation and comments for clarity regarding audio data handling.
…le files

- Streamlined import statements in various files to use package imports for consistency and clarity.
- Removed unnecessary export statements and organized the structure of the public API for better maintainability.
- Enhanced code readability by applying consistent formatting and indentation in the conversation store and event publisher classes.
- Updated comments and documentation to reflect changes and improve understanding of the codebase.
- Ensured that fire-and-forget patterns are clearly indicated with unawaited calls in relevant sections.
shubhammalhotra28 and others added 22 commits February 1, 2026 14:07
…uz/ios-ui-update

iOS UI enhancements: chat stability and UX improvements
Enable running the SDK on Android Studio x86_64 emulators by adding
x86_64 to the abiFilters in backend modules. The native libraries are
already built for x86_64 in the CI/CD pipeline.

Files updated:
- runanywhere-core-llamacpp (Kotlin)
- runanywhere-core-onnx (Kotlin)
- @runanywhere/core (React Native)
- @runanywhere/onnx (React Native)
- @runanywhere/llamacpp (React Native)

Fixes RunanywhereAI#325
- Add ModelRequiredOverlay for onboarding experience
- Add VoiceAssistantParticleView for visual effects
- Add ModelLogoHelper utility and logo assets
- Refactor all screens (Chat, STT, TTS, Voice, Settings) with UI improvements
- Configure production API credentials
- Bump version to 0.1.5

+1,897 insertions, -1,436 deletions
* Optimise resampling and heap thrashing

opti

* improved even more for integer ratios

* Update whispercpp_backend.cpp
Autonomous Android agent that navigates phone UI to accomplish user goals.
Uses accessibility service for screen reading/actions, GPT-4o Vision for
visual understanding, unified tool calling for all UI actions, and
on-device LLM fallback via RunAnywhere SDK. Includes voice mode with
Whisper STT and TTS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Force-add files excluded by root .gitignore but required to build
the android-use-agent project.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Block Bixby packages from app search results
- Add Samsung-specific package fallbacks
- Fix X/Twitter launch with explicit ComponentName intent
- Update prompts to use current app names ("X" not "Twitter")

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4-emulator-support

Add x86_64 ABI support for Android emulator development
- Bump swift-asn1 dependency version to 1.5.1
- Refactor ChatScreen to improve input handling and logging
- Introduce ModelRequiredOverlay for model selection prompts
- Add VoiceAssistantParticleView for enhanced visual effects
- Implement utility functions for model logo retrieval
- Add new drawable resources for model logos

+1,200 insertions, -800 deletions
…rt and model loaded notifications

- Introduce MarkdownText component for rendering markdown in chat messages.
- Add ModelLoadedToast component to notify users when a model is loaded.
- Update ChatScreen, SpeechToTextScreen, and TextToSpeechScreen to integrate model loaded notifications.
- Refactor MessageBubbleView to conditionally render markdown content for assistant messages.
- Improve UI responsiveness with animations and state management for model loading.

+500 insertions, -200 deletions
…ywhereAI#331)

* feat(android): major UI refactor and production setup (v0.1.5)

- Add ModelRequiredOverlay for onboarding experience
- Add VoiceAssistantParticleView for visual effects
- Add ModelLogoHelper utility and logo assets
- Refactor all screens (Chat, STT, TTS, Voice, Settings) with UI improvements
- Configure production API credentials
- Bump version to 0.1.5

+1,897 insertions, -1,436 deletions

* feat(android): update dependencies and enhance chat UI

- Bump swift-asn1 dependency version to 1.5.1
- Refactor ChatScreen to improve input handling and logging
- Introduce ModelRequiredOverlay for model selection prompts
- Add VoiceAssistantParticleView for enhanced visual effects
- Implement utility functions for model logo retrieval
- Add new drawable resources for model logos

+1,200 insertions, -800 deletions

* feat(android): enhance chat and speech-to-text UI with markdown support and model loaded notifications

- Introduce MarkdownText component for rendering markdown in chat messages.
- Add ModelLoadedToast component to notify users when a model is loaded.
- Update ChatScreen, SpeechToTextScreen, and TextToSpeechScreen to integrate model loaded notifications.
- Refactor MessageBubbleView to conditionally render markdown content for assistant messages.
- Improve UI responsiveness with animations and state management for model loading.

+500 insertions, -200 deletions

---------

Co-authored-by: Sanchit Monga <sanchitmonga22@gmail.com>
…d/android-use-agent

feat: Add Android Use Agent to Playground
…al UTF-8 decode) (RunanywhereAI#342)

* Fix JNI streaming crash by decoding UTF-8 incrementally in Kotlin

* Remove excessive test logs
…ng, stats, and tests

Fix reviewer-flagged bugs: add error_code field to timing struct, set t6/error_code
on all error paths, capture prompt tokens from backend, fix JNI local ref leak.

Add extended metrics provider interface, JSON/CSV/log serialization, statistical
analysis with percentiles and outlier detection. All 29 unit tests passing.

>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'C++ Commons: Benchmark Timing Infrastructure — Complete Implementation' accurately summarizes the main addition: a comprehensive benchmark timing system for the C++ Commons layer with timing captures, metrics, logging, and statistics.
Description check ✅ Passed The description is comprehensive, detailing bug fixes (status/error semantic mismatch, incomplete error paths, lost prompt tokens, JNI leaks), new features (extended metrics, logging, statistical analysis), unit tests (29 passing), and build validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 148cf39 in 24 seconds. Click for details.
  • Reviewed 2791 lines of code in 25 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_2ZDcSsTCNq9Ttggo

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 674-730: The timing-variant loop leaks stop-sequence text and may
drop pre-stop text; update the stop-sequence handling in this loop (symbols:
stop_sequences, accumulated_text, cached_token_chars, callback, is_valid_utf8)
to match the original generate_stream sliding-window behavior: scan only the
trailing window of accumulated_text (bounded by the longest stop sequence) to
detect a stop and compute the exact stop position, then flush callback with only
the text before the stop (including any pending valid UTF-8 prefix in
cached_token_chars) and discard the stop sequence before breaking; also avoid
rescanning the full accumulated_text each token to remove the O(n²) behavior.
Ensure the final flush after the loop never includes the stop-sequence bytes.

In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp`:
- Around line 117-125: Replace the plain %d used for timing->error_code with the
portable int32 format macro to match the rac_result_t type and project
convention: change the format specifier in the snprintf call in
rac_benchmark_log.cpp to use PRId32 for timing->error_code (the same call that
already uses PRId64 for the int64_t fields), ensuring the corresponding format
string and argument order remain consistent.

In `@sdk/runanywhere-commons/src/core/rac_benchmark_metrics.cpp`:
- Around line 46-60: The write path in rac_benchmark_set_metrics_provider can
race when two threads read the same g_provider_index and both write to
g_provider_storage[next], causing torn provider structs; guard the assignment to
g_provider_storage[next].fn and .user_data, the store to g_provider, and the
g_provider_index update with a short-lived mutex (e.g., a static std::mutex) so
only one thread updates the double-buffer slot at a time; keep reader side
lock-free and only serialize the rare registration path to prevent mismatched
fn/user_data observations.

In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp`:
- Around line 910-917: The code is recording estimated token counts
(final_result.prompt_tokens and final_result.completion_tokens) instead of the
backend's actual counts; update the data flow so timing_out->prompt_tokens and
timing_out->output_tokens use the actual tokenizer counts produced by the
backend (the out_prompt_tokens/out_completion_tokens set in
llamacpp_backend.cpp) rather than estimate_tokens: ensure the backend populates
these actual counts into the service-layer result object (final_result) and that
llm_component.cpp reads those actual fields when setting
timing_out->prompt_tokens and timing_out->output_tokens (or, if the fields
already exist under different names, map them into final_result before this
block) so benchmarking (e.g., prefill throughput) uses real token counts.

In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 1024-1031: The JNI registration for onToken in
racLlmComponentGenerateStreamWithTiming is using the wrong signature
"(Ljava/lang/String;)Z" while the callback handler llm_stream_callback_token
constructs a jbyteArray and calls CallBooleanMethod; update the GetMethodID call
that assigns onTokenMethod (using tokenCallback) to use the byte-array signature
"([B)Z" so the method signature matches llm_stream_callback_token and is
consistent with the non-timing variant (the code paths around tokenCallback,
onTokenMethod, and llm_stream_callback_token must use the same "([B)Z"
signature).
🧹 Nitpick comments (18)
sdk/runanywhere-commons/tests/benchmark/test_benchmark_stats.cpp (1)

19-43: Consider using RAC_SUCCESS instead of literal 0 for error_code.

Line 40 sets timing.error_code = 0 — using RAC_SUCCESS would be more consistent with the rest of the codebase and self-documenting.

Suggested change
-    timing.error_code = 0;
+    timing.error_code = RAC_SUCCESS;
sdk/runanywhere-commons/tests/CMakeLists.txt (2)

36-39: Redundant include directory — already inherited from rac_commons PUBLIC interface.

rac_commons exposes $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include> as a PUBLIC include directory (parent CMakeLists.txt, line 228). Since rac_benchmark_tests links against rac_commons, this target_include_directories block is unnecessary and can be removed.

Suggested removal
-target_include_directories(rac_benchmark_tests
-    PRIVATE
-        ${CMAKE_SOURCE_DIR}/include
-)

7-11: Consider pinning GoogleTest with a commit hash for reproducible builds.

Using GIT_TAG v1.14.0 is fine for a release tag, but a full commit SHA would provide stronger supply-chain guarantees (tags can be force-pushed).

sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp (1)

228-278: Consider extracting request-building into a helper to reduce duplication.

Lines 242–255 are a near-exact copy of the request-building logic in rac_llm_llamacpp_generate_stream (lines 200–213). A small static helper (e.g., build_request_from_options) would keep both paths in sync and reduce maintenance burden.

♻️ Example helper extraction
+static runanywhere::TextGenerationRequest build_request(const char* prompt,
+                                                        const rac_llm_options_t* options) {
+    runanywhere::TextGenerationRequest request;
+    request.prompt = prompt;
+    if (options != nullptr) {
+        request.max_tokens = options->max_tokens;
+        request.temperature = options->temperature;
+        request.top_p = options->top_p;
+        if (options->stop_sequences != nullptr && options->num_stop_sequences > 0) {
+            for (int32_t i = 0; i < options->num_stop_sequences; i++) {
+                if (options->stop_sequences[i]) {
+                    request.stop_sequences.push_back(options->stop_sequences[i]);
+                }
+            }
+        }
+    }
+    return request;
+}

Then both rac_llm_llamacpp_generate_stream and rac_llm_llamacpp_generate_stream_with_timing (and even rac_llm_llamacpp_generate) can call build_request(prompt, options).

sdk/runanywhere-commons/tests/benchmark/test_timing_struct.cpp (1)

56-73: TimestampOrdering test only asserts its own test data, not production invariants.

This test manually assigns ordered values and then checks they're ordered — it doesn't exercise any production code that enforces or validates timestamp ordering. It's effectively a no-op. Consider either removing it or turning it into a test for a validation function if one exists (or is planned).

sdk/runanywhere-commons/src/core/rac_benchmark.cpp (1)

49-53: Add defensive static_assert to guard against future struct evolution.

memset is currently safe since rac_benchmark_timing_t contains only POD types (int64_t, int32_t). However, the struct could inadvertently acquire non-trivial members in the future (e.g., std::string, smart pointers), which would corrupt silently when memset'd. Adding a compile-time assertion prevents this:

🛡️ Proposed defensive assertion
+#include <type_traits>
+
 void rac_benchmark_timing_init(rac_benchmark_timing_t* timing) {
     if (timing != nullptr) {
+        static_assert(std::is_trivially_copyable_v<rac_benchmark_timing_t>,
+                      "memset init requires trivially copyable type");
         std::memset(timing, 0, sizeof(rac_benchmark_timing_t));
     }
 }
sdk/runanywhere-commons/tests/benchmark/test_benchmark_log.cpp (1)

42-74: Good test coverage for JSON serialization.

The test verifies field presence but not derived metric values. For example, with the test data: ttft_ms should be 65.00 (1065-1000), prefill_ms should be 50.00 (1060-1010), decode_ms should be 1005.00 (2065-1060), e2e_ms should be 1070.00 (2070-1000). Adding value assertions would catch regressions in the computation logic.

💡 Optional: add value assertions for derived metrics
     // Verify derived metrics exist
-    EXPECT_NE(s.find("\"ttft_ms\":"), std::string::npos);
-    EXPECT_NE(s.find("\"prefill_ms\":"), std::string::npos);
-    EXPECT_NE(s.find("\"decode_ms\":"), std::string::npos);
-    EXPECT_NE(s.find("\"e2e_ms\":"), std::string::npos);
-    EXPECT_NE(s.find("\"decode_tps\":"), std::string::npos);
+    EXPECT_NE(s.find("\"ttft_ms\":65.00"), std::string::npos);
+    EXPECT_NE(s.find("\"prefill_ms\":50.00"), std::string::npos);
+    EXPECT_NE(s.find("\"decode_ms\":1005.00"), std::string::npos);
+    EXPECT_NE(s.find("\"e2e_ms\":1070.00"), std::string::npos);
+    // decode_tps = 100 / 1005.0 * 1000.0 ≈ 99.50
+    EXPECT_NE(s.find("\"decode_tps\":99.50"), std::string::npos);
sdk/runanywhere-commons/include/rac/features/llm/rac_llm_service.h (1)

42-55: Vtable documentation could mention error-path timing behavior.

The doc describes which timestamps (t2, t3, t5) backends should capture on success, but doesn't specify what implementations should do with timing_out on error (e.g., should t3/t5 still be stamped?). The LlamaCPP backend does stamp t3/t5 on prefill decode failure — documenting this expectation helps future backend implementors. Based on learnings: "Public C API headers in include/rac/ must document vtable operations, error codes, and lifecycle requirements."

sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (2)

587-738: Consider extracting shared logic to reduce duplication with generate_stream.

generate_stream_with_timing duplicates ~120 lines from generate_stream (prompt building, tokenization, batch setup, decode loop, cleanup). The only additions are ~15 lines of timing instrumentation and the (currently divergent) stop-sequence handling. This duplication is a maintenance risk — the two code paths can drift as shown by the stop-sequence handling differences.

Consider refactoring to a single internal method that takes an optional timing_out parameter, similar to how generate_stream(request, callback) already delegates to generate_stream(request, callback, nullptr) for the out_prompt_tokens parameter.


678-686: static local variable declared inside a loop body.

The static const std::vector<std::string> stop_sequences at line 678 is declared inside the while loop. While functionally correct (initialized once due to static), this is unusual and inconsistent with the original generate_stream which declares STOP_SEQUENCES before the loop (line 481). Moving it before the loop improves readability.

sdk/runanywhere-commons/include/rac/core/rac_benchmark.h (1)

88-102: Consider using an enum for benchmark status codes.

The status codes are #define macros, which work fine in C but lose type safety. Since this is a C API header (with extern "C"), macros are acceptable. However, if you anticipate C++ consumers predominantly, a C enum with fixed underlying type could provide better tooling support (autocomplete, switch exhaustiveness).

This is purely optional given the established pattern in the codebase.

sdk/runanywhere-commons/src/features/llm/llm_component.cpp (1)

721-951: Substantial code duplication with rac_llm_component_generate_stream.

The new function (~230 lines) largely duplicates rac_llm_component_generate_stream (lines 525-719). The timing-specific additions are ~30 lines (timing init, t0/t6 capture, error path timing, and timing finalization). This duplication mirrors the backend-level duplication noted in llamacpp_backend.cpp.

Consider refactoring both functions into a single internal implementation that accepts an optional timing_out parameter. The non-timing public function would delegate with timing_out = nullptr.

sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (2)

1017-1018: Unused config variable.

config is assigned at line 1018 via getNullableCString but never referenced. The same pattern exists in the other streaming functions (lines 803, 904). Consider either removing these or actually using them to parse options from configJson.


1076-1118: Consider extracting the JSON-escape-and-build logic into a helper.

The JSON text escaping switch-case block and result JSON construction are duplicated five times in this file (lines 557–586, 841–873, 957–987, 1077–1118, and the STT transcribe path). A small helper like escapeJsonString(const std::string&) would reduce ~200 lines of repetition and centralize correctness (e.g., the current escaper doesn't handle control characters below 0x20 other than \n\r\t, which can produce invalid JSON).

sdk/runanywhere-commons/include/rac/core/rac_benchmark_metrics.h (1)

98-106: Document NULL-pointer behavior for rac_benchmark_capture_metrics.

The doc comment says out "must not be NULL", but doesn't specify what happens if the caller violates this. For a public C API, consider documenting whether passing NULL is undefined behavior or a silent no-op (consistent with the implementation in rac_benchmark_metrics.cpp). As per coding guidelines, public C API headers must document error codes and lifecycle requirements.

sdk/runanywhere-commons/src/core/rac_benchmark_stats.cpp (2)

259-322: snprintf with %.2f can produce nan or inf for edge-case double values, yielding invalid JSON.

If any summary field is NaN or infinity (e.g., from a degenerate stddev computation), snprintf(buf, ..., "%.2f", val) will output "nan" or "inf", which are not valid JSON number literals. This is unlikely given the current guards but could surface if the code evolves.

💡 Defensive check example
+    auto safe_double = [](char* buf, size_t sz, double val) {
+        if (std::isfinite(val)) {
+            snprintf(buf, sz, "%.2f", val);
+        } else {
+            snprintf(buf, sz, "0.00");
+        }
+    };
+
     // TTFT
-    snprintf(buf, sizeof(buf), "%.2f", summary->ttft_p50_ms);
+    safe_double(buf, sizeof(buf), summary->ttft_p50_ms);
     json += "\"ttft_p50_ms\":" + std::string(buf) + ",";

184-194: Population stddev (÷ n) vs. sample stddev (÷ n−1).

The current implementation divides by n (population stddev). For benchmark analysis with small sample sizes, Bessel's correction (dividing by n - 1) is often preferred. This is a minor point since the size check on line 185 already handles n <= 1, and for large sample sizes the difference is negligible.

sdk/runanywhere-commons/include/rac/core/rac_benchmark_stats.h (1)

91-97: Document the allocation failure error code for rac_benchmark_stats_create.

The doc says it returns RAC_SUCCESS or RAC_ERROR_NULL_POINTER, but the implementation (line 215 of rac_benchmark_stats.cpp) also returns RAC_ERROR_INITIALIZATION_FAILED when new (std::nothrow) fails. As per coding guidelines, public C API headers should document error codes.

📝 Suggested doc update
 /**
  * Creates a new benchmark stats collector.
  *
  * `@param` out_handle Output: collector handle
- * `@return` RAC_SUCCESS or RAC_ERROR_NULL_POINTER
+ * `@return` RAC_SUCCESS, RAC_ERROR_NULL_POINTER, or RAC_ERROR_INITIALIZATION_FAILED
  */

Comment on lines +46 to +60
void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider,
void* user_data) {
if (provider == nullptr) {
g_provider.store(nullptr, std::memory_order_release);
return;
}

// Use double-buffering to avoid data races on the provider struct
int idx = g_provider_index.load(std::memory_order_relaxed);
int next = 1 - idx;
g_provider_storage[next].fn = provider;
g_provider_storage[next].user_data = user_data;
g_provider.store(&g_provider_storage[next], std::memory_order_release);
g_provider_index.store(next, std::memory_order_relaxed);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Concurrent calls to rac_benchmark_set_metrics_provider can race on the same storage slot.

If two threads call this function simultaneously and both read the same g_provider_index, they'll write to the same g_provider_storage[next] slot non-atomically (.fn and .user_data are plain fields). The reader could observe a torn MetricsProvider (mismatched fn/user_data).

Since the comment states registration is rare and happens during initialization, this is unlikely in practice but worth documenting or guarding with a simple mutex on the write path.

🔒 Suggested fix: add a mutex to the write path only
+#include <mutex>
+
 namespace {
 
 struct MetricsProvider {
     rac_benchmark_metrics_provider_fn fn = nullptr;
     void* user_data = nullptr;
 };
 
 std::atomic<MetricsProvider*> g_provider{nullptr};
 MetricsProvider g_provider_storage[2];
 std::atomic<int> g_provider_index{0};
+std::mutex g_provider_write_mutex;
 
 }  // namespace

 void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider,
                                          void* user_data) {
     if (provider == nullptr) {
         g_provider.store(nullptr, std::memory_order_release);
         return;
     }
 
+    std::lock_guard<std::mutex> lock(g_provider_write_mutex);
     int idx = g_provider_index.load(std::memory_order_relaxed);
     int next = 1 - idx;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider,
void* user_data) {
if (provider == nullptr) {
g_provider.store(nullptr, std::memory_order_release);
return;
}
// Use double-buffering to avoid data races on the provider struct
int idx = g_provider_index.load(std::memory_order_relaxed);
int next = 1 - idx;
g_provider_storage[next].fn = provider;
g_provider_storage[next].user_data = user_data;
g_provider.store(&g_provider_storage[next], std::memory_order_release);
g_provider_index.store(next, std::memory_order_relaxed);
}
`#include` <mutex>
namespace {
struct MetricsProvider {
rac_benchmark_metrics_provider_fn fn = nullptr;
void* user_data = nullptr;
};
std::atomic<MetricsProvider*> g_provider{nullptr};
MetricsProvider g_provider_storage[2];
std::atomic<int> g_provider_index{0};
std::mutex g_provider_write_mutex;
} // namespace
void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider,
void* user_data) {
if (provider == nullptr) {
g_provider.store(nullptr, std::memory_order_release);
return;
}
std::lock_guard<std::mutex> lock(g_provider_write_mutex);
int idx = g_provider_index.load(std::memory_order_relaxed);
int next = 1 - idx;
g_provider_storage[next].fn = provider;
g_provider_storage[next].user_data = user_data;
g_provider.store(&g_provider_storage[next], std::memory_order_release);
g_provider_index.store(next, std::memory_order_relaxed);
}
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/core/rac_benchmark_metrics.cpp` around lines 46 -
60, The write path in rac_benchmark_set_metrics_provider can race when two
threads read the same g_provider_index and both write to
g_provider_storage[next], causing torn provider structs; guard the assignment to
g_provider_storage[next].fn and .user_data, the store to g_provider, and the
g_provider_index update with a short-lived mutex (e.g., a static std::mutex) so
only one thread updates the double-buffer slot at a time; keep reader side
lock-free and only serialize the rare registration path to prevent mismatched
fn/user_data observations.

Comment on lines +1024 to +1031
jclass callbackClass = env->GetObjectClass(tokenCallback);
jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z");
env->DeleteLocalRef(callbackClass);

if (!onTokenMethod) {
LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method");
return nullptr;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's get the file and check the critical lines
if [ -f "sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp" ]; then
  echo "=== Lines around 1025 (timing variant) ==="
  sed -n '1020,1035p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
  
  echo ""
  echo "=== Lines around 911 (non-timing variant) ==="
  sed -n '905,920p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
  
  echo ""
  echo "=== Lines around 718-726 (llm_stream_callback_token) ==="
  sed -n '715,730p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
else
  echo "File not found"
fi

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 1794


🏁 Script executed:

#!/bin/bash
# Search for the callback interface definitions in Kotlin/Java
echo "=== Searching for onToken callback interface ==="
rg -n "onToken|fun onToken" --type=kt --type=java -B 2 -A 2

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 166


🏁 Script executed:

#!/bin/bash
# Check the context of llm_stream_callback_token to understand the shared context
echo "=== Looking at llm_stream_callback_token and callback context ==="
sed -n '690,750p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 1893


🏁 Script executed:

#!/bin/bash
# Search for where callbacks are set up in both functions
echo "=== Timing variant setup (around function definition) ==="
sed -n '998,1050p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp

echo ""
echo "=== Non-timing variant setup ==="
sed -n '884,920p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 3403


🏁 Script executed:

#!/bin/bash
# Search for where the callbacks are passed to C API functions
echo "=== Searching for rac_llm_component_generate_stream calls ==="
rg -n "rac_llm_component_generate_stream" sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp -B 2 -A 2

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 1881


🏁 Script executed:

#!/bin/bash
# Look for Java callback interface definitions
echo "=== Searching for callback interface or listener definitions ==="
fd "\.java" sdk/runanywhere-commons --exec grep -l "onToken" {} \;
fd "\.kt" sdk/runanywhere-commons --exec grep -l "onToken" {} \;

Repository: RunanywhereAI/runanywhere-sdks

Length of output: 138


Critical: JNI method signature mismatch causes runtime failure.

Line 1025 registers onToken with signature (Ljava/lang/String;)Z (expects String), but both the timing and non-timing variants use the same callback handler, llm_stream_callback_token (lines 718–726), which creates a jbyteArray and passes it to CallBooleanMethod. This signature mismatch will cause a JNI method lookup failure or type error at runtime.

The non-timing variant at line 911 correctly uses ([B)Z (byte array). The timing variant must match:

     jclass callbackClass = env->GetObjectClass(tokenCallback);
-    jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z");
+    jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "([B)Z");
     env->DeleteLocalRef(callbackClass);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
jclass callbackClass = env->GetObjectClass(tokenCallback);
jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z");
env->DeleteLocalRef(callbackClass);
if (!onTokenMethod) {
LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method");
return nullptr;
}
jclass callbackClass = env->GetObjectClass(tokenCallback);
jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "([B)Z");
env->DeleteLocalRef(callbackClass);
if (!onTokenMethod) {
LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method");
return nullptr;
}
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines
1024 - 1031, The JNI registration for onToken in
racLlmComponentGenerateStreamWithTiming is using the wrong signature
"(Ljava/lang/String;)Z" while the callback handler llm_stream_callback_token
constructs a jbyteArray and calls CallBooleanMethod; update the GetMethodID call
that assigns onTokenMethod (using tokenCallback) to use the byte-array signature
"([B)Z" so the method signature matches llm_stream_callback_token and is
consistent with the non-timing variant (the code paths around tokenCallback,
onTokenMethod, and llm_stream_callback_token must use the same "([B)Z"
signature).

…ce condition, token counts

- Stop-sequence sliding window (llamacpp_backend.cpp): port the
  Utf8State/stop_window approach from generate_stream to the timing
  variant generate_stream_with_timing, matching the non-timing variant's
  behavior exactly.

- PRId32 format (rac_benchmark_log.cpp): change %d to PRId32 for all
  int32_t fields to match the PRId64 convention for int64_t fields.

- Mutex guard (rac_benchmark_metrics.cpp): add static std::mutex around
  the double-buffer write path in rac_benchmark_set_metrics_provider to
  prevent torn fn/user_data pairs. Reader side remains lock-free.

- Actual token counts (llamacpp_backend.cpp + llm_component.cpp): write
  tokens_generated to timing_out->output_tokens in the backend; read
  backend-populated prompt_tokens/output_tokens in the component layer
  instead of overwriting with estimate_tokens() heuristics.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 587-770: The two functions generate_stream_with_timing and
generate_stream are duplicated; refactor by merging them into a single
implementation (e.g., keep generate_stream and add an optional
rac_benchmark_timing_t* timing_out = nullptr parameter) and move all shared
logic (prompt building via build_prompt, tokenization via common_tokenize,
decode loop using llama_sampler_sample/accept, llama_decode, stop-sequence
handling, UTF-8 buffering, callbacks, llama_batch management, and
llama_memory_clear) into that unified function; guard the timing writes
(t2_prefill_start_ms, t3_prefill_end_ms, t5_last_token_ms, output_tokens) with
null checks as currently done in generate_stream_with_timing, update callers
(including generate_stream_with_timing wrapper or remove it) to pass timing_out
when needed, and remove the duplicate generate_stream_with_timing body so only
the single unified function remains.
- Around line 587-770: The method
LlamaCppTextGeneration::generate_stream_with_timing computes prompt_tokens but
never stores it into timing_out, so downstream benchmarking falls back to
estimates; fix by, after computing prompt_tokens (the variable set from
tokens_list.size()) and after checking timing_out != nullptr, assign
timing_out->prompt_tokens = prompt_tokens (ensure timing_out is non-null before
writing). Place this write early (where out_prompt_tokens is set) in
generate_stream_with_timing so the actual tokenizer count is preserved for later
checks.

In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp`:
- Around line 75-85: The JSON numeric formatting in rac_benchmark_timing_to_json
is locale-dependent because it uses snprintf with "%.2f" (for ttft_ms,
prefill_ms, decode_ms_val, e2e_ms, tps), which can emit commas as decimal
separators on some locales; change this to a locale-independent formatter:
either wrap the formatting in a RAII locale guard that sets LC_NUMERIC to "C"
while formatting and restores the previous locale, or (preferably) replace
snprintf usage with a locale-independent API such as std::to_chars (C++17) or
platform-specific snprintf_l/uselocale calls to produce a dot decimal separator
for all locales; ensure the replacement formats values to two decimal places and
updates the same JSON string assembly that currently references ttft_ms,
prefill_ms, decode_ms_val, e2e_ms, and tps.

In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp`:
- Around line 890-900: The completion token fallback currently always calls
estimate_tokens when timing_out->output_tokens is not set by the backend; change
the fallback to prefer a locally-known token count before estimating: in the
branch that sets final_result.completion_tokens (checking timing_out and
timing_out->output_tokens), if timing_out->output_tokens is missing or zero,
assign final_result.completion_tokens from ctx.completion_tokens (or
ctx.prompt_tokens if ctx.completion_tokens is not available) and only call
estimate_tokens(ctx.full_text.c_str()) as the last resort; update the logic
around timing_out, final_result.completion_tokens, ctx.completion_tokens, and
estimate_tokens accordingly so LlamaCPP backend omissions don’t force
estimation.
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp (1)

149-153: Format specifiers for int32_t fields use %d — acceptable for logging but inconsistent with the CSV path.

The CSV serialization (line 120) correctly uses PRId32 for prompt_tokens, output_tokens, status, and error_code, but the log line here uses plain %d. This works in practice on all target platforms (where int32_t is int) but is technically non-portable.

sdk/runanywhere-commons/src/features/llm/llm_component.cpp (1)

721-962: Consider unifying with rac_llm_component_generate_stream to reduce duplication.

This ~240-line function is structurally identical to rac_llm_component_generate_stream (lines 525–719) with timing instrumentation added. The event-emission boilerplate and validation logic are repeated verbatim. A shared helper or a single implementation with an optional timing_out parameter would reduce the maintenance surface.

Lower priority than the backend refactor since most of the duplicated code here is straightforward boilerplate.

Comment on lines +587 to +770
bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request,
TextStreamCallback callback,
int* out_prompt_tokens,
rac_benchmark_timing_t* timing_out) {
std::lock_guard<std::mutex> lock(mutex_);

if (!is_ready()) {
LOGE("Model not ready for generation");
return false;
}

cancel_requested_.store(false);

std::string prompt = build_prompt(request);
LOGI("Generating with timing, prompt length: %zu", prompt.length());

const auto tokens_list = common_tokenize(context_, prompt, true, true);

int n_ctx = llama_n_ctx(context_);
int prompt_tokens = static_cast<int>(tokens_list.size());

if (out_prompt_tokens) {
*out_prompt_tokens = prompt_tokens;
}

int available_tokens = n_ctx - prompt_tokens - 4;

if (available_tokens <= 0) {
LOGE("Prompt too long: %d tokens, context size: %d", prompt_tokens, n_ctx);
return false;
}

int effective_max_tokens = std::min(request.max_tokens, available_tokens);
if (effective_max_tokens < request.max_tokens) {
LOGI("Capping max_tokens: %d → %d (context=%d, prompt=%d tokens)", request.max_tokens,
effective_max_tokens, n_ctx, prompt_tokens);
}
LOGI("Generation with timing: prompt_tokens=%d, max_tokens=%d, context=%d", prompt_tokens,
effective_max_tokens, n_ctx);

llama_batch batch = llama_batch_init(n_ctx, 0, 1);

batch.n_tokens = 0;
for (size_t i = 0; i < tokens_list.size(); i++) {
common_batch_add(batch, tokens_list[i], i, {0}, false);
}
batch.logits[batch.n_tokens - 1] = true;

// t2: Record prefill start (before llama_decode for prompt)
if (timing_out != nullptr) {
timing_out->t2_prefill_start_ms = rac_monotonic_now_ms();
}

if (llama_decode(context_, batch) != 0) {
LOGE("llama_decode failed for prompt");
if (timing_out != nullptr) {
int64_t now = rac_monotonic_now_ms();
timing_out->t3_prefill_end_ms = now;
timing_out->t5_last_token_ms = now;
}
llama_batch_free(batch);
return false;
}

// t3: Record prefill end (after llama_decode returns)
if (timing_out != nullptr) {
timing_out->t3_prefill_end_ms = rac_monotonic_now_ms();
}

llama_sampler_reset(sampler_);

const auto vocab = llama_model_get_vocab(model_);

static const std::vector<std::string> STOP_SEQUENCES = {
"<|im_end|>", "<|eot_id|>", "</s>", "<|end|>", "<|endoftext|>",
"\n\nUser:", "\n\nHuman:",
};

static const size_t MAX_STOP_LEN = []{
size_t m = 0;
for (const auto& s : STOP_SEQUENCES) m = std::max(m, s.size());
return m;
}();

std::string stop_window;
stop_window.reserve(MAX_STOP_LEN * 2);

std::string partial_utf8_buffer;
partial_utf8_buffer.reserve(8);

int n_cur = batch.n_tokens;
int tokens_generated = 0;
bool stop_sequence_hit = false;

while (tokens_generated < effective_max_tokens && !cancel_requested_.load()) {
const llama_token new_token_id = llama_sampler_sample(sampler_, context_, -1);

llama_sampler_accept(sampler_, new_token_id);

if (llama_vocab_is_eog(vocab, new_token_id)) {
LOGI("End of generation token received");
break;
}

const std::string new_token_chars =
common_token_to_piece(context_, new_token_id);

partial_utf8_buffer.append(new_token_chars);

Utf8State scanner_state;
size_t valid_upto = 0;
for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) {
scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i]));
if (scanner_state.state == 0) {
valid_upto = i + 1;
}
}

if (valid_upto > 0) {
std::string valid_chunk = partial_utf8_buffer.substr(0, valid_upto);
stop_window.append(valid_chunk);
partial_utf8_buffer.erase(0, valid_upto);

size_t found_stop_pos = std::string::npos;
for (const auto& stop_seq : STOP_SEQUENCES) {
size_t pos = stop_window.find(stop_seq);
if (pos != std::string::npos) {
if (found_stop_pos == std::string::npos || pos < found_stop_pos) {
found_stop_pos = pos;
}
}
}

if (found_stop_pos != std::string::npos) {
LOGI("Stop sequence detected");
stop_sequence_hit = true;
if (found_stop_pos > 0) {
if (!callback(stop_window.substr(0, found_stop_pos))) {
cancel_requested_.store(true);
}
}
break;
}

if (stop_window.size() > MAX_STOP_LEN) {
size_t safe_len = stop_window.size() - MAX_STOP_LEN;
if (!callback(stop_window.substr(0, safe_len))) {
LOGI("Generation cancelled by callback");
cancel_requested_.store(true);
break;
}
stop_window.erase(0, safe_len);
}
}

batch.n_tokens = 0;
common_batch_add(batch, new_token_id, n_cur, {0}, true);

n_cur++;
tokens_generated++;

if (llama_decode(context_, batch) != 0) {
LOGE("llama_decode failed during generation");
break;
}
}

// t5: Record last token time (decode loop exit)
if (timing_out != nullptr) {
timing_out->t5_last_token_ms = rac_monotonic_now_ms();
timing_out->output_tokens = static_cast<int32_t>(tokens_generated);
}

if (!cancel_requested_.load() && !stop_sequence_hit && !stop_window.empty()) {
callback(stop_window);
}

llama_memory_clear(llama_get_memory(context_), true);

llama_batch_free(batch);

LOGI("Generation with timing complete: %d tokens", tokens_generated);
return !cancel_requested_.load();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

~180 lines of near-identical code duplicated from generate_stream.

generate_stream_with_timing is a copy of generate_stream with only timing instrumentation added (t2/t3/t5 null-checked writes). This violates DRY and creates a maintenance burden — any bug fix or feature change to the decode loop, stop-sequence handling, UTF-8 validation, etc., must be applied in both places.

Consider refactoring generate_stream to accept an optional rac_benchmark_timing_t* parameter (defaulting to nullptr), and guarding the timing writes with null checks. The existing generate_stream can then delegate to the unified implementation.

Sketch
-bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request,
-                                             TextStreamCallback callback,
-                                             int* out_prompt_tokens) {
-    // ... ~150 lines of logic ...
-}
-
-bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request,
-                                                         TextStreamCallback callback,
-                                                         int* out_prompt_tokens,
-                                                         rac_benchmark_timing_t* timing_out) {
-    // ... ~180 lines of nearly identical logic ...
-}
+bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request,
+                                             TextStreamCallback callback,
+                                             int* out_prompt_tokens) {
+    return generate_stream_with_timing(request, callback, out_prompt_tokens, nullptr);
+}
+
+bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request,
+                                                         TextStreamCallback callback,
+                                                         int* out_prompt_tokens,
+                                                         rac_benchmark_timing_t* timing_out) {
+    // Single unified implementation with null-guarded timing writes
+}
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 587 - 770, The two functions generate_stream_with_timing and
generate_stream are duplicated; refactor by merging them into a single
implementation (e.g., keep generate_stream and add an optional
rac_benchmark_timing_t* timing_out = nullptr parameter) and move all shared
logic (prompt building via build_prompt, tokenization via common_tokenize,
decode loop using llama_sampler_sample/accept, llama_decode, stop-sequence
handling, UTF-8 buffering, callbacks, llama_batch management, and
llama_memory_clear) into that unified function; guard the timing writes
(t2_prefill_start_ms, t3_prefill_end_ms, t5_last_token_ms, output_tokens) with
null checks as currently done in generate_stream_with_timing, update callers
(including generate_stream_with_timing wrapper or remove it) to pass timing_out
when needed, and remove the duplicate generate_stream_with_timing body so only
the single unified function remains.

⚠️ Potential issue | 🟠 Major

timing_out->prompt_tokens is never set — backend's actual token count is lost.

Line 606 computes the real prompt_tokens from the tokenizer, and line 608-610 writes it to *out_prompt_tokens. However, timing_out->prompt_tokens is never populated in this method. At the component layer (llm_component.cpp:890), the code checks timing_out->prompt_tokens > 0 and falls back to estimate_tokens() when it's zero — so the actual tokenizer count is silently discarded for benchmarking.

Proposed fix
     if (out_prompt_tokens) {
         *out_prompt_tokens = prompt_tokens;
     }
+    if (timing_out != nullptr) {
+        timing_out->prompt_tokens = static_cast<int32_t>(prompt_tokens);
+    }
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 587 - 770, The method LlamaCppTextGeneration::generate_stream_with_timing
computes prompt_tokens but never stores it into timing_out, so downstream
benchmarking falls back to estimates; fix by, after computing prompt_tokens (the
variable set from tokens_list.size()) and after checking timing_out != nullptr,
assign timing_out->prompt_tokens = prompt_tokens (ensure timing_out is non-null
before writing). Place this write early (where out_prompt_tokens is set) in
generate_stream_with_timing so the actual tokenizer count is preserved for later
checks.

Comment on lines +75 to +85
char buf[64];
snprintf(buf, sizeof(buf), "%.2f", ttft_ms);
json += "\"ttft_ms\":" + std::string(buf) + ",";
snprintf(buf, sizeof(buf), "%.2f", prefill_ms);
json += "\"prefill_ms\":" + std::string(buf) + ",";
snprintf(buf, sizeof(buf), "%.2f", decode_ms_val);
json += "\"decode_ms\":" + std::string(buf) + ",";
snprintf(buf, sizeof(buf), "%.2f", e2e_ms);
json += "\"e2e_ms\":" + std::string(buf) + ",";
snprintf(buf, sizeof(buf), "%.2f", tps);
json += "\"decode_tps\":" + std::string(buf);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Locale-dependent snprintf %.2f can produce invalid JSON on non-English locales.

snprintf with %f is locale-sensitive in C/C++. On locales that use , as the decimal separator (e.g., de_DE), "ttft_ms":1,23 would be emitted — breaking JSON parsing. Since this targets cross-platform (iOS/Android), locale variation is realistic.

Consider either setting LC_NUMERIC to "C" before formatting (and restoring after), or using a locale-independent formatter.

Proposed fix: use a RAII locale guard
+#include <clocale>
+
+namespace {
+
+struct CLocaleGuard {
+    std::string saved;
+    CLocaleGuard() {
+        const char* prev = std::setlocale(LC_NUMERIC, nullptr);
+        if (prev) saved = prev;
+        std::setlocale(LC_NUMERIC, "C");
+    }
+    ~CLocaleGuard() {
+        std::setlocale(LC_NUMERIC, saved.c_str());
+    }
+};
+
+}  // namespace

Then at the top of rac_benchmark_timing_to_json:

 char* rac_benchmark_timing_to_json(const rac_benchmark_timing_t* timing) {
     if (timing == nullptr) {
         return nullptr;
     }
+    CLocaleGuard locale_guard;

Note: setlocale is not thread-safe in C/C++. For a thread-safe alternative, consider snprintf_l (POSIX), uselocale (POSIX), or std::to_chars (C++17) which is locale-independent.

🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp` around lines 75 - 85,
The JSON numeric formatting in rac_benchmark_timing_to_json is locale-dependent
because it uses snprintf with "%.2f" (for ttft_ms, prefill_ms, decode_ms_val,
e2e_ms, tps), which can emit commas as decimal separators on some locales;
change this to a locale-independent formatter: either wrap the formatting in a
RAII locale guard that sets LC_NUMERIC to "C" while formatting and restores the
previous locale, or (preferably) replace snprintf usage with a
locale-independent API such as std::to_chars (C++17) or platform-specific
snprintf_l/uselocale calls to produce a dot decimal separator for all locales;
ensure the replacement formats values to two decimal places and updates the same
JSON string assembly that currently references ttft_ms, prefill_ms,
decode_ms_val, e2e_ms, and tps.

Comment on lines +890 to +900
if (timing_out != nullptr && timing_out->prompt_tokens > 0) {
final_result.prompt_tokens = timing_out->prompt_tokens;
} else {
final_result.prompt_tokens = ctx.prompt_tokens;
}

if (timing_out != nullptr && timing_out->output_tokens > 0) {
final_result.completion_tokens = timing_out->output_tokens;
} else {
final_result.completion_tokens = estimate_tokens(ctx.full_text.c_str());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback to estimate_tokens will always trigger until the backend bug is fixed.

This code correctly prefers backend-provided token counts from timing_out->prompt_tokens, but the LlamaCPP backend never sets timing_out->prompt_tokens (see review on llamacpp_backend.cpp). Until that's fixed, this will always use the estimate, undermining benchmark accuracy.

🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp` around lines 890
- 900, The completion token fallback currently always calls estimate_tokens when
timing_out->output_tokens is not set by the backend; change the fallback to
prefer a locally-known token count before estimating: in the branch that sets
final_result.completion_tokens (checking timing_out and
timing_out->output_tokens), if timing_out->output_tokens is missing or zero,
assign final_result.completion_tokens from ctx.completion_tokens (or
ctx.prompt_tokens if ctx.completion_tokens is not available) and only call
estimate_tokens(ctx.full_text.c_str()) as the last resort; update the logic
around timing_out, final_result.completion_tokens, ctx.completion_tokens, and
estimate_tokens accordingly so LlamaCPP backend omissions don’t force
estimation.

@Siddhesh2377
Copy link
Copy Markdown
Collaborator

Hey @AmanSwar
Should this PR be merged ? is it still in construction ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.