C++ Commons: Benchmark Timing Infrastructure — Complete Implementation#343
C++ Commons: Benchmark Timing Infrastructure — Complete Implementation#343AmanSwar wants to merge 1297 commits intoRunanywhereAI:mainfrom
Conversation
- RABackendLlamaCPP checksum updated for core-v0.1.4 - RABackendONNX checksum updated for core-v0.1.4 - Kotlin SDK version: 0.1.3 → 0.1.4
- runanywhere.coreVersion: 0.1.3 → 0.1.4 - runanywhere.commonsVersion: 0.1.3 → 0.1.4 This file was overriding the defaults in build.gradle.kts
…ce agent capabilities - Added base64 encoding/decoding utilities for audio data handling. - Implemented memory usage retrieval for iOS and Android platforms. - Introduced global component handles for LLM, STT, TTS, and VAD to streamline resource management. - Enhanced voice agent functionality with methods for processing voice turns and generating responses. - Updated TypeScript definitions for TTS output to include additional fields for backward compatibility. - Improved error handling and logging across various components for better debugging and user experience.
- Updated ModelSelectionSheet to check for valid CPU cores before rendering. - Added consumer rules for ArchiveUtility in Android to ensure proper JNI access. - Enhanced cpp-adapter.cpp with logging and exception handling for better debugging during JNI calls. - Refactored ArchiveUtility.kt to utilize Apache Commons Compress for efficient tar.gz extraction and improved error logging. - Cleaned up unused code and improved overall structure for better maintainability.
- Updated COMMONS_VERSION, LLAMACPP_VERSION, ONNX_BACKEND_VERSION, and CORE_VERSION to 0.1.4 in respective podspec and build.gradle files. - Adjusted version files in iOS and Android frameworks to reflect the new versioning. - Ensured consistency in versioning across all relevant components for better maintainability.
- Updated .gitignore to exclude downloaded XCFrameworks and native libraries for iOS and Android. - Removed obsolete version files and header files related to LlamaCpp and ONNX packages. - Ensured that the project structure is cleaner by eliminating unnecessary files, improving maintainability.
…d_rearch [Kotlin-SDK] [Kotlin-Example] KMP SDK refactor and aligning to iOS
- Downgraded 'record' dependency version from ^6.1.2 to ^6.1.0 in pubspec.yaml. - Enhanced error handling for plugin registration in GeneratedPluginRegistrant.java and GeneratedPluginRegistrant.m. - Removed unused ChangeNotifierProvider from runanywhere_ai_app.dart. - Updated model download progress handling in model_manager.dart and model_selection_sheet.dart. - Cleaned up unused code and improved comments across various files for better maintainability. - Added new files for ONNX and LlamaCpp plugins, including their respective podspecs and Gradle configurations. - Removed deprecated native libraries and related files from the project.
- Introduced SDKLogger for structured logging in both iOS and Android. - Replaced print and NSLog statements with SDKLogger methods for consistency. - Added SwiftLint rules to enforce SDKLogger usage in Swift files. - Updated various components to utilize the new logging system, enhancing error tracking and debugging capabilities. - Configured logging settings in package.json to disallow console.log usage, ensuring all logs are routed through SDKLogger.
- Introduced a new Lefthook configuration file for managing Git hooks. - Updated Podfile to use static linkage for RACommons symbols and improved comments for clarity. - Removed deprecated entries from project.pbxproj to streamline the build process. - Enhanced model management in runanywhere_ai_app.dart by simplifying model registration and unloading logic. - Cleaned up unused code and improved comments across various files for better maintainability. - Added new markdown files for Flutter SDK to Swift SDK migration analysis and parity migration plan.
…ed stability - Added a new isolate for LLM generation to prevent heap corruption from C++ background threads. - Introduced a helper class for isolate communication and error handling. - Updated the generate method to utilize the new isolate approach, enhancing performance and reliability.
- Enhanced the LLM generation process to support real-time streaming of tokens using isolates. - Introduced a new `LLMStreamingResult` class to encapsulate the streaming output, final result, and cancellation functionality. - Updated the `generateStream` method to return a stream of tokens and a future for final metrics, allowing for better user interaction and control. - Improved error handling and communication between isolates for robust streaming operations.
- Refactored ModelManager to reduce complexity by removing event listeners and internal state management. - Updated methods to directly interact with the RunAnywhere SDK for model loading, unloading, and state checks. - Enhanced chat and voice features to utilize SDK state directly, improving synchronization and reducing reliance on ModelManager. - Cleaned up unused code and improved comments for better maintainability across various files.
- Added TTS synthesis capabilities in the DartBridgeTTS class, allowing for speech generation from text with customizable rate, pitch, and volume. - Enhanced the TTSResult class to encapsulate audio samples, sample rate, and duration for synthesized speech. - Integrated TTS synthesis into the RunAnywhere class, providing a straightforward API for users to synthesize speech. - Implemented audio playback from synthesized samples, ensuring seamless user experience. - Updated comments and documentation for clarity and maintainability across the TTS-related files.
- Implemented checks for voice agent readiness in the SDK, ensuring all required components (STT, LLM, TTS) are loaded before starting a voice session. - Refactored voice session handling to utilize the new SDK API for starting sessions, improving integration and reliability. - Introduced detailed error handling for voice agent initialization failures, providing clearer feedback to users. - Added new types for voice agent component states, enhancing the API for checking model readiness. - Updated documentation and comments for clarity and maintainability across voice-related files.
- Added a check to determine if audio data is already in WAV format before conversion, enhancing efficiency. - Updated the DartBridgeVoiceAgent to reflect that synthesized audio is now in WAV format, eliminating unnecessary conversions. - Refactored the RunAnywhere class to directly use WAV audio data for playback, streamlining the audio processing workflow. - Improved documentation and comments for clarity regarding audio data handling.
…le files - Streamlined import statements in various files to use package imports for consistency and clarity. - Removed unnecessary export statements and organized the structure of the public API for better maintainability. - Enhanced code readability by applying consistent formatting and indentation in the conversation store and event publisher classes. - Updated comments and documentation to reflect changes and improve understanding of the codebase. - Ensured that fire-and-forget patterns are clearly indicated with unawaited calls in relevant sections.
…ations Optimise energy_vad
…uz/ios-ui-update iOS UI enhancements: chat stability and UX improvements
Enable running the SDK on Android Studio x86_64 emulators by adding x86_64 to the abiFilters in backend modules. The native libraries are already built for x86_64 in the CI/CD pipeline. Files updated: - runanywhere-core-llamacpp (Kotlin) - runanywhere-core-onnx (Kotlin) - @runanywhere/core (React Native) - @runanywhere/onnx (React Native) - @runanywhere/llamacpp (React Native) Fixes RunanywhereAI#325
Changes:
- Add ModelRequiredOverlay for onboarding experience - Add VoiceAssistantParticleView for visual effects - Add ModelLogoHelper utility and logo assets - Refactor all screens (Chat, STT, TTS, Voice, Settings) with UI improvements - Configure production API credentials - Bump version to 0.1.5 +1,897 insertions, -1,436 deletions
Optimised llamacpp_backend.cpp
* Optimise resampling and heap thrashing opti * improved even more for integer ratios * Update whispercpp_backend.cpp
Autonomous Android agent that navigates phone UI to accomplish user goals. Uses accessibility service for screen reading/actions, GPT-4o Vision for visual understanding, unified tool calling for all UI actions, and on-device LLM fallback via RunAnywhere SDK. Includes voice mode with Whisper STT and TTS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Force-add files excluded by root .gitignore but required to build the android-use-agent project. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Block Bixby packages from app search results
- Add Samsung-specific package fallbacks
- Fix X/Twitter launch with explicit ComponentName intent
- Update prompts to use current app names ("X" not "Twitter")
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…4-emulator-support Add x86_64 ABI support for Android emulator development
- Bump swift-asn1 dependency version to 1.5.1 - Refactor ChatScreen to improve input handling and logging - Introduce ModelRequiredOverlay for model selection prompts - Add VoiceAssistantParticleView for enhanced visual effects - Implement utility functions for model logo retrieval - Add new drawable resources for model logos +1,200 insertions, -800 deletions
…rt and model loaded notifications - Introduce MarkdownText component for rendering markdown in chat messages. - Add ModelLoadedToast component to notify users when a model is loaded. - Update ChatScreen, SpeechToTextScreen, and TextToSpeechScreen to integrate model loaded notifications. - Refactor MessageBubbleView to conditionally render markdown content for assistant messages. - Improve UI responsiveness with animations and state management for model loading. +500 insertions, -200 deletions
…ywhereAI#331) * feat(android): major UI refactor and production setup (v0.1.5) - Add ModelRequiredOverlay for onboarding experience - Add VoiceAssistantParticleView for visual effects - Add ModelLogoHelper utility and logo assets - Refactor all screens (Chat, STT, TTS, Voice, Settings) with UI improvements - Configure production API credentials - Bump version to 0.1.5 +1,897 insertions, -1,436 deletions * feat(android): update dependencies and enhance chat UI - Bump swift-asn1 dependency version to 1.5.1 - Refactor ChatScreen to improve input handling and logging - Introduce ModelRequiredOverlay for model selection prompts - Add VoiceAssistantParticleView for enhanced visual effects - Implement utility functions for model logo retrieval - Add new drawable resources for model logos +1,200 insertions, -800 deletions * feat(android): enhance chat and speech-to-text UI with markdown support and model loaded notifications - Introduce MarkdownText component for rendering markdown in chat messages. - Add ModelLoadedToast component to notify users when a model is loaded. - Update ChatScreen, SpeechToTextScreen, and TextToSpeechScreen to integrate model loaded notifications. - Refactor MessageBubbleView to conditionally render markdown content for assistant messages. - Improve UI responsiveness with animations and state management for model loading. +500 insertions, -200 deletions --------- Co-authored-by: Sanchit Monga <sanchitmonga22@gmail.com>
…d/android-use-agent feat: Add Android Use Agent to Playground
…al UTF-8 decode) (RunanywhereAI#342) * Fix JNI streaming crash by decoding UTF-8 incrementally in Kotlin * Remove excessive test logs
…ng, stats, and tests Fix reviewer-flagged bugs: add error_code field to timing struct, set t6/error_code on all error paths, capture prompt tokens from backend, fix JNI local ref leak. Add extended metrics provider interface, JSON/CSV/log serialization, statistical analysis with percentiles and outlier detection. All 29 unit tests passing. >
📝 Walkthrough🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to 148cf39 in 24 seconds. Click for details.
- Reviewed
2791lines of code in25files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_2ZDcSsTCNq9Ttggo
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 674-730: The timing-variant loop leaks stop-sequence text and may
drop pre-stop text; update the stop-sequence handling in this loop (symbols:
stop_sequences, accumulated_text, cached_token_chars, callback, is_valid_utf8)
to match the original generate_stream sliding-window behavior: scan only the
trailing window of accumulated_text (bounded by the longest stop sequence) to
detect a stop and compute the exact stop position, then flush callback with only
the text before the stop (including any pending valid UTF-8 prefix in
cached_token_chars) and discard the stop sequence before breaking; also avoid
rescanning the full accumulated_text each token to remove the O(n²) behavior.
Ensure the final flush after the loop never includes the stop-sequence bytes.
In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp`:
- Around line 117-125: Replace the plain %d used for timing->error_code with the
portable int32 format macro to match the rac_result_t type and project
convention: change the format specifier in the snprintf call in
rac_benchmark_log.cpp to use PRId32 for timing->error_code (the same call that
already uses PRId64 for the int64_t fields), ensuring the corresponding format
string and argument order remain consistent.
In `@sdk/runanywhere-commons/src/core/rac_benchmark_metrics.cpp`:
- Around line 46-60: The write path in rac_benchmark_set_metrics_provider can
race when two threads read the same g_provider_index and both write to
g_provider_storage[next], causing torn provider structs; guard the assignment to
g_provider_storage[next].fn and .user_data, the store to g_provider, and the
g_provider_index update with a short-lived mutex (e.g., a static std::mutex) so
only one thread updates the double-buffer slot at a time; keep reader side
lock-free and only serialize the rare registration path to prevent mismatched
fn/user_data observations.
In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp`:
- Around line 910-917: The code is recording estimated token counts
(final_result.prompt_tokens and final_result.completion_tokens) instead of the
backend's actual counts; update the data flow so timing_out->prompt_tokens and
timing_out->output_tokens use the actual tokenizer counts produced by the
backend (the out_prompt_tokens/out_completion_tokens set in
llamacpp_backend.cpp) rather than estimate_tokens: ensure the backend populates
these actual counts into the service-layer result object (final_result) and that
llm_component.cpp reads those actual fields when setting
timing_out->prompt_tokens and timing_out->output_tokens (or, if the fields
already exist under different names, map them into final_result before this
block) so benchmarking (e.g., prefill throughput) uses real token counts.
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp`:
- Around line 1024-1031: The JNI registration for onToken in
racLlmComponentGenerateStreamWithTiming is using the wrong signature
"(Ljava/lang/String;)Z" while the callback handler llm_stream_callback_token
constructs a jbyteArray and calls CallBooleanMethod; update the GetMethodID call
that assigns onTokenMethod (using tokenCallback) to use the byte-array signature
"([B)Z" so the method signature matches llm_stream_callback_token and is
consistent with the non-timing variant (the code paths around tokenCallback,
onTokenMethod, and llm_stream_callback_token must use the same "([B)Z"
signature).
🧹 Nitpick comments (18)
sdk/runanywhere-commons/tests/benchmark/test_benchmark_stats.cpp (1)
19-43: Consider usingRAC_SUCCESSinstead of literal0forerror_code.Line 40 sets
timing.error_code = 0— usingRAC_SUCCESSwould be more consistent with the rest of the codebase and self-documenting.Suggested change
- timing.error_code = 0; + timing.error_code = RAC_SUCCESS;sdk/runanywhere-commons/tests/CMakeLists.txt (2)
36-39: Redundant include directory — already inherited fromrac_commonsPUBLIC interface.
rac_commonsexposes$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>as a PUBLIC include directory (parentCMakeLists.txt, line 228). Sincerac_benchmark_testslinks againstrac_commons, thistarget_include_directoriesblock is unnecessary and can be removed.Suggested removal
-target_include_directories(rac_benchmark_tests - PRIVATE - ${CMAKE_SOURCE_DIR}/include -)
7-11: Consider pinning GoogleTest with a commit hash for reproducible builds.Using
GIT_TAG v1.14.0is fine for a release tag, but a full commit SHA would provide stronger supply-chain guarantees (tags can be force-pushed).sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp (1)
228-278: Consider extracting request-building into a helper to reduce duplication.Lines 242–255 are a near-exact copy of the request-building logic in
rac_llm_llamacpp_generate_stream(lines 200–213). A small static helper (e.g.,build_request_from_options) would keep both paths in sync and reduce maintenance burden.♻️ Example helper extraction
+static runanywhere::TextGenerationRequest build_request(const char* prompt, + const rac_llm_options_t* options) { + runanywhere::TextGenerationRequest request; + request.prompt = prompt; + if (options != nullptr) { + request.max_tokens = options->max_tokens; + request.temperature = options->temperature; + request.top_p = options->top_p; + if (options->stop_sequences != nullptr && options->num_stop_sequences > 0) { + for (int32_t i = 0; i < options->num_stop_sequences; i++) { + if (options->stop_sequences[i]) { + request.stop_sequences.push_back(options->stop_sequences[i]); + } + } + } + } + return request; +}Then both
rac_llm_llamacpp_generate_streamandrac_llm_llamacpp_generate_stream_with_timing(and evenrac_llm_llamacpp_generate) can callbuild_request(prompt, options).sdk/runanywhere-commons/tests/benchmark/test_timing_struct.cpp (1)
56-73:TimestampOrderingtest only asserts its own test data, not production invariants.This test manually assigns ordered values and then checks they're ordered — it doesn't exercise any production code that enforces or validates timestamp ordering. It's effectively a no-op. Consider either removing it or turning it into a test for a validation function if one exists (or is planned).
sdk/runanywhere-commons/src/core/rac_benchmark.cpp (1)
49-53: Add defensivestatic_assertto guard against future struct evolution.
memsetis currently safe sincerac_benchmark_timing_tcontains only POD types (int64_t, int32_t). However, the struct could inadvertently acquire non-trivial members in the future (e.g., std::string, smart pointers), which would corrupt silently whenmemset'd. Adding a compile-time assertion prevents this:🛡️ Proposed defensive assertion
+#include <type_traits> + void rac_benchmark_timing_init(rac_benchmark_timing_t* timing) { if (timing != nullptr) { + static_assert(std::is_trivially_copyable_v<rac_benchmark_timing_t>, + "memset init requires trivially copyable type"); std::memset(timing, 0, sizeof(rac_benchmark_timing_t)); } }sdk/runanywhere-commons/tests/benchmark/test_benchmark_log.cpp (1)
42-74: Good test coverage for JSON serialization.The test verifies field presence but not derived metric values. For example, with the test data:
ttft_msshould be 65.00 (1065-1000),prefill_msshould be 50.00 (1060-1010),decode_msshould be 1005.00 (2065-1060),e2e_msshould be 1070.00 (2070-1000). Adding value assertions would catch regressions in the computation logic.💡 Optional: add value assertions for derived metrics
// Verify derived metrics exist - EXPECT_NE(s.find("\"ttft_ms\":"), std::string::npos); - EXPECT_NE(s.find("\"prefill_ms\":"), std::string::npos); - EXPECT_NE(s.find("\"decode_ms\":"), std::string::npos); - EXPECT_NE(s.find("\"e2e_ms\":"), std::string::npos); - EXPECT_NE(s.find("\"decode_tps\":"), std::string::npos); + EXPECT_NE(s.find("\"ttft_ms\":65.00"), std::string::npos); + EXPECT_NE(s.find("\"prefill_ms\":50.00"), std::string::npos); + EXPECT_NE(s.find("\"decode_ms\":1005.00"), std::string::npos); + EXPECT_NE(s.find("\"e2e_ms\":1070.00"), std::string::npos); + // decode_tps = 100 / 1005.0 * 1000.0 ≈ 99.50 + EXPECT_NE(s.find("\"decode_tps\":99.50"), std::string::npos);sdk/runanywhere-commons/include/rac/features/llm/rac_llm_service.h (1)
42-55: Vtable documentation could mention error-path timing behavior.The doc describes which timestamps (t2, t3, t5) backends should capture on success, but doesn't specify what implementations should do with
timing_outon error (e.g., should t3/t5 still be stamped?). The LlamaCPP backend does stamp t3/t5 on prefill decode failure — documenting this expectation helps future backend implementors. Based on learnings: "Public C API headers in include/rac/ must document vtable operations, error codes, and lifecycle requirements."sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (2)
587-738: Consider extracting shared logic to reduce duplication withgenerate_stream.
generate_stream_with_timingduplicates ~120 lines fromgenerate_stream(prompt building, tokenization, batch setup, decode loop, cleanup). The only additions are ~15 lines of timing instrumentation and the (currently divergent) stop-sequence handling. This duplication is a maintenance risk — the two code paths can drift as shown by the stop-sequence handling differences.Consider refactoring to a single internal method that takes an optional
timing_outparameter, similar to howgenerate_stream(request, callback)already delegates togenerate_stream(request, callback, nullptr)for theout_prompt_tokensparameter.
678-686:staticlocal variable declared inside a loop body.The
static const std::vector<std::string> stop_sequencesat line 678 is declared inside thewhileloop. While functionally correct (initialized once due tostatic), this is unusual and inconsistent with the originalgenerate_streamwhich declaresSTOP_SEQUENCESbefore the loop (line 481). Moving it before the loop improves readability.sdk/runanywhere-commons/include/rac/core/rac_benchmark.h (1)
88-102: Consider using an enum for benchmark status codes.The status codes are
#definemacros, which work fine in C but lose type safety. Since this is a C API header (withextern "C"), macros are acceptable. However, if you anticipate C++ consumers predominantly, a C enum with fixed underlying type could provide better tooling support (autocomplete, switch exhaustiveness).This is purely optional given the established pattern in the codebase.
sdk/runanywhere-commons/src/features/llm/llm_component.cpp (1)
721-951: Substantial code duplication withrac_llm_component_generate_stream.The new function (~230 lines) largely duplicates
rac_llm_component_generate_stream(lines 525-719). The timing-specific additions are ~30 lines (timing init, t0/t6 capture, error path timing, and timing finalization). This duplication mirrors the backend-level duplication noted inllamacpp_backend.cpp.Consider refactoring both functions into a single internal implementation that accepts an optional
timing_outparameter. The non-timing public function would delegate withtiming_out = nullptr.sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp (2)
1017-1018: Unusedconfigvariable.
configis assigned at line 1018 viagetNullableCStringbut never referenced. The same pattern exists in the other streaming functions (lines 803, 904). Consider either removing these or actually using them to parse options fromconfigJson.
1076-1118: Consider extracting the JSON-escape-and-build logic into a helper.The JSON text escaping switch-case block and result JSON construction are duplicated five times in this file (lines 557–586, 841–873, 957–987, 1077–1118, and the STT transcribe path). A small helper like
escapeJsonString(const std::string&)would reduce ~200 lines of repetition and centralize correctness (e.g., the current escaper doesn't handle control characters below 0x20 other than\n\r\t, which can produce invalid JSON).sdk/runanywhere-commons/include/rac/core/rac_benchmark_metrics.h (1)
98-106: Document NULL-pointer behavior forrac_benchmark_capture_metrics.The doc comment says
out"must not be NULL", but doesn't specify what happens if the caller violates this. For a public C API, consider documenting whether passing NULL is undefined behavior or a silent no-op (consistent with the implementation inrac_benchmark_metrics.cpp). As per coding guidelines, public C API headers must document error codes and lifecycle requirements.sdk/runanywhere-commons/src/core/rac_benchmark_stats.cpp (2)
259-322:snprintfwith%.2fcan producenanorinffor edge-case double values, yielding invalid JSON.If any summary field is NaN or infinity (e.g., from a degenerate stddev computation),
snprintf(buf, ..., "%.2f", val)will output"nan"or"inf", which are not valid JSON number literals. This is unlikely given the current guards but could surface if the code evolves.💡 Defensive check example
+ auto safe_double = [](char* buf, size_t sz, double val) { + if (std::isfinite(val)) { + snprintf(buf, sz, "%.2f", val); + } else { + snprintf(buf, sz, "0.00"); + } + }; + // TTFT - snprintf(buf, sizeof(buf), "%.2f", summary->ttft_p50_ms); + safe_double(buf, sizeof(buf), summary->ttft_p50_ms); json += "\"ttft_p50_ms\":" + std::string(buf) + ",";
184-194: Population stddev (÷ n) vs. sample stddev (÷ n−1).The current implementation divides by
n(population stddev). For benchmark analysis with small sample sizes, Bessel's correction (dividing byn - 1) is often preferred. This is a minor point since the size check on line 185 already handlesn <= 1, and for large sample sizes the difference is negligible.sdk/runanywhere-commons/include/rac/core/rac_benchmark_stats.h (1)
91-97: Document the allocation failure error code forrac_benchmark_stats_create.The doc says it returns
RAC_SUCCESS or RAC_ERROR_NULL_POINTER, but the implementation (line 215 ofrac_benchmark_stats.cpp) also returnsRAC_ERROR_INITIALIZATION_FAILEDwhennew (std::nothrow)fails. As per coding guidelines, public C API headers should document error codes.📝 Suggested doc update
/** * Creates a new benchmark stats collector. * * `@param` out_handle Output: collector handle - * `@return` RAC_SUCCESS or RAC_ERROR_NULL_POINTER + * `@return` RAC_SUCCESS, RAC_ERROR_NULL_POINTER, or RAC_ERROR_INITIALIZATION_FAILED */
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Outdated
Show resolved
Hide resolved
| void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider, | ||
| void* user_data) { | ||
| if (provider == nullptr) { | ||
| g_provider.store(nullptr, std::memory_order_release); | ||
| return; | ||
| } | ||
|
|
||
| // Use double-buffering to avoid data races on the provider struct | ||
| int idx = g_provider_index.load(std::memory_order_relaxed); | ||
| int next = 1 - idx; | ||
| g_provider_storage[next].fn = provider; | ||
| g_provider_storage[next].user_data = user_data; | ||
| g_provider.store(&g_provider_storage[next], std::memory_order_release); | ||
| g_provider_index.store(next, std::memory_order_relaxed); | ||
| } |
There was a problem hiding this comment.
Concurrent calls to rac_benchmark_set_metrics_provider can race on the same storage slot.
If two threads call this function simultaneously and both read the same g_provider_index, they'll write to the same g_provider_storage[next] slot non-atomically (.fn and .user_data are plain fields). The reader could observe a torn MetricsProvider (mismatched fn/user_data).
Since the comment states registration is rare and happens during initialization, this is unlikely in practice but worth documenting or guarding with a simple mutex on the write path.
🔒 Suggested fix: add a mutex to the write path only
+#include <mutex>
+
namespace {
struct MetricsProvider {
rac_benchmark_metrics_provider_fn fn = nullptr;
void* user_data = nullptr;
};
std::atomic<MetricsProvider*> g_provider{nullptr};
MetricsProvider g_provider_storage[2];
std::atomic<int> g_provider_index{0};
+std::mutex g_provider_write_mutex;
} // namespace
void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider,
void* user_data) {
if (provider == nullptr) {
g_provider.store(nullptr, std::memory_order_release);
return;
}
+ std::lock_guard<std::mutex> lock(g_provider_write_mutex);
int idx = g_provider_index.load(std::memory_order_relaxed);
int next = 1 - idx;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider, | |
| void* user_data) { | |
| if (provider == nullptr) { | |
| g_provider.store(nullptr, std::memory_order_release); | |
| return; | |
| } | |
| // Use double-buffering to avoid data races on the provider struct | |
| int idx = g_provider_index.load(std::memory_order_relaxed); | |
| int next = 1 - idx; | |
| g_provider_storage[next].fn = provider; | |
| g_provider_storage[next].user_data = user_data; | |
| g_provider.store(&g_provider_storage[next], std::memory_order_release); | |
| g_provider_index.store(next, std::memory_order_relaxed); | |
| } | |
| `#include` <mutex> | |
| namespace { | |
| struct MetricsProvider { | |
| rac_benchmark_metrics_provider_fn fn = nullptr; | |
| void* user_data = nullptr; | |
| }; | |
| std::atomic<MetricsProvider*> g_provider{nullptr}; | |
| MetricsProvider g_provider_storage[2]; | |
| std::atomic<int> g_provider_index{0}; | |
| std::mutex g_provider_write_mutex; | |
| } // namespace | |
| void rac_benchmark_set_metrics_provider(rac_benchmark_metrics_provider_fn provider, | |
| void* user_data) { | |
| if (provider == nullptr) { | |
| g_provider.store(nullptr, std::memory_order_release); | |
| return; | |
| } | |
| std::lock_guard<std::mutex> lock(g_provider_write_mutex); | |
| int idx = g_provider_index.load(std::memory_order_relaxed); | |
| int next = 1 - idx; | |
| g_provider_storage[next].fn = provider; | |
| g_provider_storage[next].user_data = user_data; | |
| g_provider.store(&g_provider_storage[next], std::memory_order_release); | |
| g_provider_index.store(next, std::memory_order_relaxed); | |
| } |
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/core/rac_benchmark_metrics.cpp` around lines 46 -
60, The write path in rac_benchmark_set_metrics_provider can race when two
threads read the same g_provider_index and both write to
g_provider_storage[next], causing torn provider structs; guard the assignment to
g_provider_storage[next].fn and .user_data, the store to g_provider, and the
g_provider_index update with a short-lived mutex (e.g., a static std::mutex) so
only one thread updates the double-buffer slot at a time; keep reader side
lock-free and only serialize the rare registration path to prevent mismatched
fn/user_data observations.
| jclass callbackClass = env->GetObjectClass(tokenCallback); | ||
| jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z"); | ||
| env->DeleteLocalRef(callbackClass); | ||
|
|
||
| if (!onTokenMethod) { | ||
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's get the file and check the critical lines
if [ -f "sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp" ]; then
echo "=== Lines around 1025 (timing variant) ==="
sed -n '1020,1035p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
echo ""
echo "=== Lines around 911 (non-timing variant) ==="
sed -n '905,920p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
echo ""
echo "=== Lines around 718-726 (llm_stream_callback_token) ==="
sed -n '715,730p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
else
echo "File not found"
fiRepository: RunanywhereAI/runanywhere-sdks
Length of output: 1794
🏁 Script executed:
#!/bin/bash
# Search for the callback interface definitions in Kotlin/Java
echo "=== Searching for onToken callback interface ==="
rg -n "onToken|fun onToken" --type=kt --type=java -B 2 -A 2Repository: RunanywhereAI/runanywhere-sdks
Length of output: 166
🏁 Script executed:
#!/bin/bash
# Check the context of llm_stream_callback_token to understand the shared context
echo "=== Looking at llm_stream_callback_token and callback context ==="
sed -n '690,750p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cppRepository: RunanywhereAI/runanywhere-sdks
Length of output: 1893
🏁 Script executed:
#!/bin/bash
# Search for where callbacks are set up in both functions
echo "=== Timing variant setup (around function definition) ==="
sed -n '998,1050p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
echo ""
echo "=== Non-timing variant setup ==="
sed -n '884,920p' sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cppRepository: RunanywhereAI/runanywhere-sdks
Length of output: 3403
🏁 Script executed:
#!/bin/bash
# Search for where the callbacks are passed to C API functions
echo "=== Searching for rac_llm_component_generate_stream calls ==="
rg -n "rac_llm_component_generate_stream" sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp -B 2 -A 2Repository: RunanywhereAI/runanywhere-sdks
Length of output: 1881
🏁 Script executed:
#!/bin/bash
# Look for Java callback interface definitions
echo "=== Searching for callback interface or listener definitions ==="
fd "\.java" sdk/runanywhere-commons --exec grep -l "onToken" {} \;
fd "\.kt" sdk/runanywhere-commons --exec grep -l "onToken" {} \;Repository: RunanywhereAI/runanywhere-sdks
Length of output: 138
Critical: JNI method signature mismatch causes runtime failure.
Line 1025 registers onToken with signature (Ljava/lang/String;)Z (expects String), but both the timing and non-timing variants use the same callback handler, llm_stream_callback_token (lines 718–726), which creates a jbyteArray and passes it to CallBooleanMethod. This signature mismatch will cause a JNI method lookup failure or type error at runtime.
The non-timing variant at line 911 correctly uses ([B)Z (byte array). The timing variant must match:
jclass callbackClass = env->GetObjectClass(tokenCallback);
- jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z");
+ jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "([B)Z");
env->DeleteLocalRef(callbackClass);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| jclass callbackClass = env->GetObjectClass(tokenCallback); | |
| jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "(Ljava/lang/String;)Z"); | |
| env->DeleteLocalRef(callbackClass); | |
| if (!onTokenMethod) { | |
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | |
| return nullptr; | |
| } | |
| jclass callbackClass = env->GetObjectClass(tokenCallback); | |
| jmethodID onTokenMethod = env->GetMethodID(callbackClass, "onToken", "([B)Z"); | |
| env->DeleteLocalRef(callbackClass); | |
| if (!onTokenMethod) { | |
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | |
| return nullptr; | |
| } |
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp` around lines
1024 - 1031, The JNI registration for onToken in
racLlmComponentGenerateStreamWithTiming is using the wrong signature
"(Ljava/lang/String;)Z" while the callback handler llm_stream_callback_token
constructs a jbyteArray and calls CallBooleanMethod; update the GetMethodID call
that assigns onTokenMethod (using tokenCallback) to use the byte-array signature
"([B)Z" so the method signature matches llm_stream_callback_token and is
consistent with the non-timing variant (the code paths around tokenCallback,
onTokenMethod, and llm_stream_callback_token must use the same "([B)Z"
signature).
…ce condition, token counts - Stop-sequence sliding window (llamacpp_backend.cpp): port the Utf8State/stop_window approach from generate_stream to the timing variant generate_stream_with_timing, matching the non-timing variant's behavior exactly. - PRId32 format (rac_benchmark_log.cpp): change %d to PRId32 for all int32_t fields to match the PRId64 convention for int64_t fields. - Mutex guard (rac_benchmark_metrics.cpp): add static std::mutex around the double-buffer write path in rac_benchmark_set_metrics_provider to prevent torn fn/user_data pairs. Reader side remains lock-free. - Actual token counts (llamacpp_backend.cpp + llm_component.cpp): write tokens_generated to timing_out->output_tokens in the backend; read backend-populated prompt_tokens/output_tokens in the component layer instead of overwriting with estimate_tokens() heuristics.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 587-770: The two functions generate_stream_with_timing and
generate_stream are duplicated; refactor by merging them into a single
implementation (e.g., keep generate_stream and add an optional
rac_benchmark_timing_t* timing_out = nullptr parameter) and move all shared
logic (prompt building via build_prompt, tokenization via common_tokenize,
decode loop using llama_sampler_sample/accept, llama_decode, stop-sequence
handling, UTF-8 buffering, callbacks, llama_batch management, and
llama_memory_clear) into that unified function; guard the timing writes
(t2_prefill_start_ms, t3_prefill_end_ms, t5_last_token_ms, output_tokens) with
null checks as currently done in generate_stream_with_timing, update callers
(including generate_stream_with_timing wrapper or remove it) to pass timing_out
when needed, and remove the duplicate generate_stream_with_timing body so only
the single unified function remains.
- Around line 587-770: The method
LlamaCppTextGeneration::generate_stream_with_timing computes prompt_tokens but
never stores it into timing_out, so downstream benchmarking falls back to
estimates; fix by, after computing prompt_tokens (the variable set from
tokens_list.size()) and after checking timing_out != nullptr, assign
timing_out->prompt_tokens = prompt_tokens (ensure timing_out is non-null before
writing). Place this write early (where out_prompt_tokens is set) in
generate_stream_with_timing so the actual tokenizer count is preserved for later
checks.
In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp`:
- Around line 75-85: The JSON numeric formatting in rac_benchmark_timing_to_json
is locale-dependent because it uses snprintf with "%.2f" (for ttft_ms,
prefill_ms, decode_ms_val, e2e_ms, tps), which can emit commas as decimal
separators on some locales; change this to a locale-independent formatter:
either wrap the formatting in a RAII locale guard that sets LC_NUMERIC to "C"
while formatting and restores the previous locale, or (preferably) replace
snprintf usage with a locale-independent API such as std::to_chars (C++17) or
platform-specific snprintf_l/uselocale calls to produce a dot decimal separator
for all locales; ensure the replacement formats values to two decimal places and
updates the same JSON string assembly that currently references ttft_ms,
prefill_ms, decode_ms_val, e2e_ms, and tps.
In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp`:
- Around line 890-900: The completion token fallback currently always calls
estimate_tokens when timing_out->output_tokens is not set by the backend; change
the fallback to prefer a locally-known token count before estimating: in the
branch that sets final_result.completion_tokens (checking timing_out and
timing_out->output_tokens), if timing_out->output_tokens is missing or zero,
assign final_result.completion_tokens from ctx.completion_tokens (or
ctx.prompt_tokens if ctx.completion_tokens is not available) and only call
estimate_tokens(ctx.full_text.c_str()) as the last resort; update the logic
around timing_out, final_result.completion_tokens, ctx.completion_tokens, and
estimate_tokens accordingly so LlamaCPP backend omissions don’t force
estimation.
🧹 Nitpick comments (2)
sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp (1)
149-153: Format specifiers forint32_tfields use%d— acceptable for logging but inconsistent with the CSV path.The CSV serialization (line 120) correctly uses
PRId32forprompt_tokens,output_tokens,status, anderror_code, but the log line here uses plain%d. This works in practice on all target platforms (whereint32_tisint) but is technically non-portable.sdk/runanywhere-commons/src/features/llm/llm_component.cpp (1)
721-962: Consider unifying withrac_llm_component_generate_streamto reduce duplication.This ~240-line function is structurally identical to
rac_llm_component_generate_stream(lines 525–719) with timing instrumentation added. The event-emission boilerplate and validation logic are repeated verbatim. A shared helper or a single implementation with an optionaltiming_outparameter would reduce the maintenance surface.Lower priority than the backend refactor since most of the duplicated code here is straightforward boilerplate.
| bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request, | ||
| TextStreamCallback callback, | ||
| int* out_prompt_tokens, | ||
| rac_benchmark_timing_t* timing_out) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
|
|
||
| if (!is_ready()) { | ||
| LOGE("Model not ready for generation"); | ||
| return false; | ||
| } | ||
|
|
||
| cancel_requested_.store(false); | ||
|
|
||
| std::string prompt = build_prompt(request); | ||
| LOGI("Generating with timing, prompt length: %zu", prompt.length()); | ||
|
|
||
| const auto tokens_list = common_tokenize(context_, prompt, true, true); | ||
|
|
||
| int n_ctx = llama_n_ctx(context_); | ||
| int prompt_tokens = static_cast<int>(tokens_list.size()); | ||
|
|
||
| if (out_prompt_tokens) { | ||
| *out_prompt_tokens = prompt_tokens; | ||
| } | ||
|
|
||
| int available_tokens = n_ctx - prompt_tokens - 4; | ||
|
|
||
| if (available_tokens <= 0) { | ||
| LOGE("Prompt too long: %d tokens, context size: %d", prompt_tokens, n_ctx); | ||
| return false; | ||
| } | ||
|
|
||
| int effective_max_tokens = std::min(request.max_tokens, available_tokens); | ||
| if (effective_max_tokens < request.max_tokens) { | ||
| LOGI("Capping max_tokens: %d → %d (context=%d, prompt=%d tokens)", request.max_tokens, | ||
| effective_max_tokens, n_ctx, prompt_tokens); | ||
| } | ||
| LOGI("Generation with timing: prompt_tokens=%d, max_tokens=%d, context=%d", prompt_tokens, | ||
| effective_max_tokens, n_ctx); | ||
|
|
||
| llama_batch batch = llama_batch_init(n_ctx, 0, 1); | ||
|
|
||
| batch.n_tokens = 0; | ||
| for (size_t i = 0; i < tokens_list.size(); i++) { | ||
| common_batch_add(batch, tokens_list[i], i, {0}, false); | ||
| } | ||
| batch.logits[batch.n_tokens - 1] = true; | ||
|
|
||
| // t2: Record prefill start (before llama_decode for prompt) | ||
| if (timing_out != nullptr) { | ||
| timing_out->t2_prefill_start_ms = rac_monotonic_now_ms(); | ||
| } | ||
|
|
||
| if (llama_decode(context_, batch) != 0) { | ||
| LOGE("llama_decode failed for prompt"); | ||
| if (timing_out != nullptr) { | ||
| int64_t now = rac_monotonic_now_ms(); | ||
| timing_out->t3_prefill_end_ms = now; | ||
| timing_out->t5_last_token_ms = now; | ||
| } | ||
| llama_batch_free(batch); | ||
| return false; | ||
| } | ||
|
|
||
| // t3: Record prefill end (after llama_decode returns) | ||
| if (timing_out != nullptr) { | ||
| timing_out->t3_prefill_end_ms = rac_monotonic_now_ms(); | ||
| } | ||
|
|
||
| llama_sampler_reset(sampler_); | ||
|
|
||
| const auto vocab = llama_model_get_vocab(model_); | ||
|
|
||
| static const std::vector<std::string> STOP_SEQUENCES = { | ||
| "<|im_end|>", "<|eot_id|>", "</s>", "<|end|>", "<|endoftext|>", | ||
| "\n\nUser:", "\n\nHuman:", | ||
| }; | ||
|
|
||
| static const size_t MAX_STOP_LEN = []{ | ||
| size_t m = 0; | ||
| for (const auto& s : STOP_SEQUENCES) m = std::max(m, s.size()); | ||
| return m; | ||
| }(); | ||
|
|
||
| std::string stop_window; | ||
| stop_window.reserve(MAX_STOP_LEN * 2); | ||
|
|
||
| std::string partial_utf8_buffer; | ||
| partial_utf8_buffer.reserve(8); | ||
|
|
||
| int n_cur = batch.n_tokens; | ||
| int tokens_generated = 0; | ||
| bool stop_sequence_hit = false; | ||
|
|
||
| while (tokens_generated < effective_max_tokens && !cancel_requested_.load()) { | ||
| const llama_token new_token_id = llama_sampler_sample(sampler_, context_, -1); | ||
|
|
||
| llama_sampler_accept(sampler_, new_token_id); | ||
|
|
||
| if (llama_vocab_is_eog(vocab, new_token_id)) { | ||
| LOGI("End of generation token received"); | ||
| break; | ||
| } | ||
|
|
||
| const std::string new_token_chars = | ||
| common_token_to_piece(context_, new_token_id); | ||
|
|
||
| partial_utf8_buffer.append(new_token_chars); | ||
|
|
||
| Utf8State scanner_state; | ||
| size_t valid_upto = 0; | ||
| for (size_t i = 0; i < partial_utf8_buffer.size(); ++i) { | ||
| scanner_state.process(static_cast<uint8_t>(partial_utf8_buffer[i])); | ||
| if (scanner_state.state == 0) { | ||
| valid_upto = i + 1; | ||
| } | ||
| } | ||
|
|
||
| if (valid_upto > 0) { | ||
| std::string valid_chunk = partial_utf8_buffer.substr(0, valid_upto); | ||
| stop_window.append(valid_chunk); | ||
| partial_utf8_buffer.erase(0, valid_upto); | ||
|
|
||
| size_t found_stop_pos = std::string::npos; | ||
| for (const auto& stop_seq : STOP_SEQUENCES) { | ||
| size_t pos = stop_window.find(stop_seq); | ||
| if (pos != std::string::npos) { | ||
| if (found_stop_pos == std::string::npos || pos < found_stop_pos) { | ||
| found_stop_pos = pos; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (found_stop_pos != std::string::npos) { | ||
| LOGI("Stop sequence detected"); | ||
| stop_sequence_hit = true; | ||
| if (found_stop_pos > 0) { | ||
| if (!callback(stop_window.substr(0, found_stop_pos))) { | ||
| cancel_requested_.store(true); | ||
| } | ||
| } | ||
| break; | ||
| } | ||
|
|
||
| if (stop_window.size() > MAX_STOP_LEN) { | ||
| size_t safe_len = stop_window.size() - MAX_STOP_LEN; | ||
| if (!callback(stop_window.substr(0, safe_len))) { | ||
| LOGI("Generation cancelled by callback"); | ||
| cancel_requested_.store(true); | ||
| break; | ||
| } | ||
| stop_window.erase(0, safe_len); | ||
| } | ||
| } | ||
|
|
||
| batch.n_tokens = 0; | ||
| common_batch_add(batch, new_token_id, n_cur, {0}, true); | ||
|
|
||
| n_cur++; | ||
| tokens_generated++; | ||
|
|
||
| if (llama_decode(context_, batch) != 0) { | ||
| LOGE("llama_decode failed during generation"); | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| // t5: Record last token time (decode loop exit) | ||
| if (timing_out != nullptr) { | ||
| timing_out->t5_last_token_ms = rac_monotonic_now_ms(); | ||
| timing_out->output_tokens = static_cast<int32_t>(tokens_generated); | ||
| } | ||
|
|
||
| if (!cancel_requested_.load() && !stop_sequence_hit && !stop_window.empty()) { | ||
| callback(stop_window); | ||
| } | ||
|
|
||
| llama_memory_clear(llama_get_memory(context_), true); | ||
|
|
||
| llama_batch_free(batch); | ||
|
|
||
| LOGI("Generation with timing complete: %d tokens", tokens_generated); | ||
| return !cancel_requested_.load(); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
~180 lines of near-identical code duplicated from generate_stream.
generate_stream_with_timing is a copy of generate_stream with only timing instrumentation added (t2/t3/t5 null-checked writes). This violates DRY and creates a maintenance burden — any bug fix or feature change to the decode loop, stop-sequence handling, UTF-8 validation, etc., must be applied in both places.
Consider refactoring generate_stream to accept an optional rac_benchmark_timing_t* parameter (defaulting to nullptr), and guarding the timing writes with null checks. The existing generate_stream can then delegate to the unified implementation.
Sketch
-bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request,
- TextStreamCallback callback,
- int* out_prompt_tokens) {
- // ... ~150 lines of logic ...
-}
-
-bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request,
- TextStreamCallback callback,
- int* out_prompt_tokens,
- rac_benchmark_timing_t* timing_out) {
- // ... ~180 lines of nearly identical logic ...
-}
+bool LlamaCppTextGeneration::generate_stream(const TextGenerationRequest& request,
+ TextStreamCallback callback,
+ int* out_prompt_tokens) {
+ return generate_stream_with_timing(request, callback, out_prompt_tokens, nullptr);
+}
+
+bool LlamaCppTextGeneration::generate_stream_with_timing(const TextGenerationRequest& request,
+ TextStreamCallback callback,
+ int* out_prompt_tokens,
+ rac_benchmark_timing_t* timing_out) {
+ // Single unified implementation with null-guarded timing writes
+}🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 587 - 770, The two functions generate_stream_with_timing and
generate_stream are duplicated; refactor by merging them into a single
implementation (e.g., keep generate_stream and add an optional
rac_benchmark_timing_t* timing_out = nullptr parameter) and move all shared
logic (prompt building via build_prompt, tokenization via common_tokenize,
decode loop using llama_sampler_sample/accept, llama_decode, stop-sequence
handling, UTF-8 buffering, callbacks, llama_batch management, and
llama_memory_clear) into that unified function; guard the timing writes
(t2_prefill_start_ms, t3_prefill_end_ms, t5_last_token_ms, output_tokens) with
null checks as currently done in generate_stream_with_timing, update callers
(including generate_stream_with_timing wrapper or remove it) to pass timing_out
when needed, and remove the duplicate generate_stream_with_timing body so only
the single unified function remains.
timing_out->prompt_tokens is never set — backend's actual token count is lost.
Line 606 computes the real prompt_tokens from the tokenizer, and line 608-610 writes it to *out_prompt_tokens. However, timing_out->prompt_tokens is never populated in this method. At the component layer (llm_component.cpp:890), the code checks timing_out->prompt_tokens > 0 and falls back to estimate_tokens() when it's zero — so the actual tokenizer count is silently discarded for benchmarking.
Proposed fix
if (out_prompt_tokens) {
*out_prompt_tokens = prompt_tokens;
}
+ if (timing_out != nullptr) {
+ timing_out->prompt_tokens = static_cast<int32_t>(prompt_tokens);
+ }🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 587 - 770, The method LlamaCppTextGeneration::generate_stream_with_timing
computes prompt_tokens but never stores it into timing_out, so downstream
benchmarking falls back to estimates; fix by, after computing prompt_tokens (the
variable set from tokens_list.size()) and after checking timing_out != nullptr,
assign timing_out->prompt_tokens = prompt_tokens (ensure timing_out is non-null
before writing). Place this write early (where out_prompt_tokens is set) in
generate_stream_with_timing so the actual tokenizer count is preserved for later
checks.
| char buf[64]; | ||
| snprintf(buf, sizeof(buf), "%.2f", ttft_ms); | ||
| json += "\"ttft_ms\":" + std::string(buf) + ","; | ||
| snprintf(buf, sizeof(buf), "%.2f", prefill_ms); | ||
| json += "\"prefill_ms\":" + std::string(buf) + ","; | ||
| snprintf(buf, sizeof(buf), "%.2f", decode_ms_val); | ||
| json += "\"decode_ms\":" + std::string(buf) + ","; | ||
| snprintf(buf, sizeof(buf), "%.2f", e2e_ms); | ||
| json += "\"e2e_ms\":" + std::string(buf) + ","; | ||
| snprintf(buf, sizeof(buf), "%.2f", tps); | ||
| json += "\"decode_tps\":" + std::string(buf); |
There was a problem hiding this comment.
Locale-dependent snprintf %.2f can produce invalid JSON on non-English locales.
snprintf with %f is locale-sensitive in C/C++. On locales that use , as the decimal separator (e.g., de_DE), "ttft_ms":1,23 would be emitted — breaking JSON parsing. Since this targets cross-platform (iOS/Android), locale variation is realistic.
Consider either setting LC_NUMERIC to "C" before formatting (and restoring after), or using a locale-independent formatter.
Proposed fix: use a RAII locale guard
+#include <clocale>
+
+namespace {
+
+struct CLocaleGuard {
+ std::string saved;
+ CLocaleGuard() {
+ const char* prev = std::setlocale(LC_NUMERIC, nullptr);
+ if (prev) saved = prev;
+ std::setlocale(LC_NUMERIC, "C");
+ }
+ ~CLocaleGuard() {
+ std::setlocale(LC_NUMERIC, saved.c_str());
+ }
+};
+
+} // namespaceThen at the top of rac_benchmark_timing_to_json:
char* rac_benchmark_timing_to_json(const rac_benchmark_timing_t* timing) {
if (timing == nullptr) {
return nullptr;
}
+ CLocaleGuard locale_guard;Note: setlocale is not thread-safe in C/C++. For a thread-safe alternative, consider snprintf_l (POSIX), uselocale (POSIX), or std::to_chars (C++17) which is locale-independent.
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/core/rac_benchmark_log.cpp` around lines 75 - 85,
The JSON numeric formatting in rac_benchmark_timing_to_json is locale-dependent
because it uses snprintf with "%.2f" (for ttft_ms, prefill_ms, decode_ms_val,
e2e_ms, tps), which can emit commas as decimal separators on some locales;
change this to a locale-independent formatter: either wrap the formatting in a
RAII locale guard that sets LC_NUMERIC to "C" while formatting and restores the
previous locale, or (preferably) replace snprintf usage with a
locale-independent API such as std::to_chars (C++17) or platform-specific
snprintf_l/uselocale calls to produce a dot decimal separator for all locales;
ensure the replacement formats values to two decimal places and updates the same
JSON string assembly that currently references ttft_ms, prefill_ms,
decode_ms_val, e2e_ms, and tps.
| if (timing_out != nullptr && timing_out->prompt_tokens > 0) { | ||
| final_result.prompt_tokens = timing_out->prompt_tokens; | ||
| } else { | ||
| final_result.prompt_tokens = ctx.prompt_tokens; | ||
| } | ||
|
|
||
| if (timing_out != nullptr && timing_out->output_tokens > 0) { | ||
| final_result.completion_tokens = timing_out->output_tokens; | ||
| } else { | ||
| final_result.completion_tokens = estimate_tokens(ctx.full_text.c_str()); | ||
| } |
There was a problem hiding this comment.
Fallback to estimate_tokens will always trigger until the backend bug is fixed.
This code correctly prefers backend-provided token counts from timing_out->prompt_tokens, but the LlamaCPP backend never sets timing_out->prompt_tokens (see review on llamacpp_backend.cpp). Until that's fixed, this will always use the estimate, undermining benchmark accuracy.
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/features/llm/llm_component.cpp` around lines 890
- 900, The completion token fallback currently always calls estimate_tokens when
timing_out->output_tokens is not set by the backend; change the fallback to
prefer a locally-known token count before estimating: in the branch that sets
final_result.completion_tokens (checking timing_out and
timing_out->output_tokens), if timing_out->output_tokens is missing or zero,
assign final_result.completion_tokens from ctx.completion_tokens (or
ctx.prompt_tokens if ctx.completion_tokens is not available) and only call
estimate_tokens(ctx.full_text.c_str()) as the last resort; update the logic
around timing_out, final_result.completion_tokens, ctx.completion_tokens, and
estimate_tokens accordingly so LlamaCPP backend omissions don’t force
estimation.
|
Hey @AmanSwar |
3f2c8d2 to
eaf9e08
Compare
Summary
Complete benchmark timing instrumentation for the RunAnywhere C++ Commons layer. Captures 6 precise timestamps during LLM inference (t0/t2/t3/t4/t5/t6), extended device metrics interfaces, JSON/CSV logging, statistical analysis with percentiles, and 29 unit tests — all with zero overhead when not benchmarking.
What Changed (This Commit)
Bug Fixes
error_codefield torac_benchmark_timing_t—statusholdsRAC_BENCHMARK_STATUS_*(0-3),error_codeholds therac_result_tllm_component.cppnow sett6_request_end_msanderror_code(previously only setstatus)rac_llm_llamacpp.cppwas passingnullptrforout_prompt_tokens— now captures actual count from backendDeleteLocalRef(callbackClass)afterGetMethodIDbenchmark_error_codeto JSON output#include <algorithm>inmodel_paths.cppNew Features
Extended Metrics (interfaces only)
rac_benchmark_metrics.h—rac_benchmark_extended_metrics_tstruct (memory, CPU temp, battery, GPU, thermal state)Automatic Logging
rac_benchmark_timing_to_json()— Full timing + derived metrics (TTFT, prefill, decode, E2E, TPS)rac_benchmark_timing_to_csv()— CSV header/row serializationrac_benchmark_timing_log()— Log summary via RAC logging systemStatistical Analysis
rac_benchmark_stats_create/destroy/record/reset/count— Opaque collector APIrac_benchmark_stats_get_summary()— P50/P95/P99 percentiles, mean, stddev, min/max for TTFT, prefill, decode TPS, E2Erac_benchmark_stats_summary_to_json()— Export summary as JSONUnit Tests — 29/29 Passing
Files Changed (25 files, +2504 lines)
New Files (11)
include/rac/core/rac_benchmark_metrics.hinclude/rac/core/rac_benchmark_log.hinclude/rac/core/rac_benchmark_stats.hsrc/core/rac_benchmark_metrics.cppsrc/core/rac_benchmark_log.cppsrc/core/rac_benchmark_stats.cpptests/CMakeLists.txttests/benchmark/test_monotonic_clock.cpptests/benchmark/test_timing_struct.cpptests/benchmark/test_benchmark_log.cpptests/benchmark/test_benchmark_stats.cppModified Files (7)
include/rac/core/rac_benchmark.herror_codefieldsrc/features/llm/llm_component.cppsrc/backends/llamacpp/rac_llm_llamacpp.cppsrc/features/llm/rac_llm_service.cppsrc/jni/runanywhere_commons_jni.cppsrc/infrastructure/model_management/model_paths.cpp#include <algorithm>CMakeLists.txtValidation Checklist
Build & Test
(2/5) Greptile learns from your feedback when you react with thumbs up/down!