LLM CPP Benchmark common metrics.#334
LLM CPP Benchmark common metrics.#334abhisekupadhyaya wants to merge 2 commits intoRunanywhereAI:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds benchmarking support and timing-aware streaming across RunAnywhere Commons: monotonic timing utilities, timing-enabled APIs at component/service/backend layers, LlamaCpp backend timing captures, and a JNI bridge exposing timing metrics for streamed LLM generations. Changes
Sequence DiagramsequenceDiagram
participant Java as Java/JNI Caller
participant JNI as JNI Bridge
participant Component as LLM Component
participant Service as LLM Service
participant Backend as LlamaCpp Backend
participant Llama as LlamaCpp Engine
Java->>JNI: racLlmComponentGenerateStreamWithTiming(...)
JNI->>Component: rac_llm_component_generate_stream_with_timing(...)
activate Component
Note over Component: t0 = rac_monotonic_now_ms()
Component->>Service: rac_llm_generate_stream_with_timing(...)
activate Service
Service->>Backend: generate_stream_with_timing(...)
activate Backend
Note over Backend: t2 = prefill start
Backend->>Llama: Prefill decode
Note over Backend: t3 = prefill end
Backend->>Llama: Token generation loop
Note over Backend: t4 = first token (streamed)
Backend->>Service: stream_callback(token)
Service->>Component: stream_callback(token)
Component->>JNI: token received
Note over Backend: t5 = last token
Backend->>Llama: Complete generation
deactivate Backend
Note over Service: return with timing_out filled
deactivate Service
Note over Component: t6 = request end
Component->>JNI: Return with timing metrics
deactivate Component
JNI->>Java: JSON result (text, timing, stats)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed everything up to afe1f80 in 40 seconds. Click for details.
- Reviewed
1098lines of code in13files - Skipped
0files when reviewing. - Skipped posting
0draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
Workflow ID: wflow_Xrf7GiYp3EfaLgpe
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| if (timing_out != nullptr) { | ||
| rac_benchmark_timing_init(timing_out); | ||
| // Record t0 (request start) - first thing after validation | ||
| timing_out->t0_request_start_ms = rac_monotonic_now_ms(); | ||
| } |
There was a problem hiding this comment.
Incorrect status encoding
rac_benchmark_timing_t::status is documented as “non-zero = error code (from rac_result_t)” (include/rac/core/rac_benchmark.h:69-74), but this function writes RAC_BENCHMARK_STATUS_* enums (e.g. RAC_BENCHMARK_STATUS_ERROR) instead of the actual rac_result_t value. This makes status ambiguous/inconsistent for consumers that expect to read back the underlying error code from the timing struct.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/features/llm/llm_component.cpp
Line: 736:740
Comment:
**Incorrect status encoding**
`rac_benchmark_timing_t::status` is documented as “non-zero = error code (from rac_result_t)” (`include/rac/core/rac_benchmark.h:69-74`), but this function writes `RAC_BENCHMARK_STATUS_*` enums (e.g. `RAC_BENCHMARK_STATUS_ERROR`) instead of the actual `rac_result_t` value. This makes status ambiguous/inconsistent for consumers that expect to read back the underlying error code from the timing struct.
How can I resolve this? If you propose a fix, please make it concise.| // t2: Record prefill start (before llama_decode for prompt) | ||
| if (timing_out != nullptr) { | ||
| timing_out->t2_prefill_start_ms = rac_monotonic_now_ms(); | ||
| } | ||
|
|
||
| if (llama_decode(context_, batch) != 0) { | ||
| LOGE("llama_decode failed for prompt"); | ||
| llama_batch_free(batch); | ||
| return false; |
There was a problem hiding this comment.
Timing left partially filled
In generate_stream_with_timing, t2_prefill_start_ms is written before the prompt llama_decode, but on the llama_decode failure path you return false without setting t3_prefill_end_ms / t5_last_token_ms. Callers that assume “timing_out non-NULL implies a complete timing record” will observe a partially-filled struct and may compute negative/incorrect durations.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 608:616
Comment:
**Timing left partially filled**
In `generate_stream_with_timing`, `t2_prefill_start_ms` is written before the prompt `llama_decode`, but on the `llama_decode` failure path you `return false` without setting `t3_prefill_end_ms` / `t5_last_token_ms`. Callers that assume “timing_out non-NULL implies a complete timing record” will observe a partially-filled struct and may compute negative/incorrect durations.
How can I resolve this? If you propose a fix, please make it concise.| // Create global ref to callback to ensure it survives across threads | ||
| jobject globalCallback = env->NewGlobalRef(tokenCallback); | ||
|
|
There was a problem hiding this comment.
GlobalRef leak on error
globalCallback = env->NewGlobalRef(tokenCallback) is created before validating subsequent JNI lookups/operations, but there are early returns (e.g. !onTokenMethod) that skip DeleteGlobalRef. That will leak a global reference on those error paths, which is especially painful for repeated calls from Kotlin.
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
Line: 1023:1025
Comment:
**GlobalRef leak on error**
`globalCallback = env->NewGlobalRef(tokenCallback)` is created before validating subsequent JNI lookups/operations, but there are early returns (e.g. `!onTokenMethod`) that skip `DeleteGlobalRef`. That will leak a global reference on those error paths, which is especially painful for repeated calls from Kotlin.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
I searched for NewGlobalRef, DeleteGlobalRef, and onTokenMethod in this file and found zero matches. This file (rac_backend_llamacpp_jni.cpp) contains only simple JNI wrappers that call the C API directly. It has no streaming callback logic and no GlobalRef handling whatsoever. This issue does not exist in this file. The reporter likely confused it with runanywhere_commons_jni.cpp.
Additional Comments (1)
Prompt To Fix With AIThis is a comment left during a code review.
Path: sdk/runanywhere-commons/src/jni/runanywhere_commons_jni.cpp
Line: 717:721
Comment:
**Missing local-ref cleanup**
`llm_stream_callback_token` creates a `jstring` with `NewStringUTF` each token but does not `DeleteLocalRef` it. In a long generation this will exhaust the local reference table and crash the JVM. (The other streaming path in this file does delete local refs, so this is an inconsistency.)
How can I resolve this? If you propose a fix, please make it concise. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@sdk/runanywhere-commons/include/rac/backends/rac_llm_llamacpp.h`:
- Around line 167-186: Document that the timing_out parameter of
rac_llm_llamacpp_generate_stream_with_timing is an optional caller-provided
pointer (may be NULL) that the caller allocates and owns for the duration of the
call, and clarify whether the function will zero/initialize all fields of
rac_benchmark_timing_t on entry and on all return paths (including errors) or
requires the caller to zero it beforehand; update the function comment to state
the ownership, lifetime (must remain valid until function returns), nullability,
and initialization guarantees (e.g. "if non-NULL, the function fully initializes
timing_out before returning; callers need not pre-zero").
In `@sdk/runanywhere-commons/include/rac/core/rac_benchmark.h`:
- Around line 69-93: The comment above the field "int32_t status" in
rac_benchmark_timing_t is inconsistent with the defined macros; change the
documentation to explicitly state that status contains one of the
RAC_BENCHMARK_STATUS_* values (RAC_BENCHMARK_STATUS_SUCCESS, ERROR, TIMEOUT,
CANCELLED) rather than a rac_result_t, and confirm the field type remains
int32_t; alternatively, if you prefer to keep rac_result_t semantics, replace
the benchmark macros with values mapped to rac_result_t and change the field
type to rac_result_t (or typedef) and update all references—pick one scheme and
make the doc and type/macros consistent (refer to the field name "status", the
type "rac_benchmark_timing_t", the enum-like macros "RAC_BENCHMARK_STATUS_*",
and the existing type "rac_result_t").
In `@sdk/runanywhere-commons/include/rac/features/llm/rac_llm_component.h`:
- Around line 200-228: Document that
rac_llm_component_generate_stream_with_timing requires the caller to allocate
and own a valid rac_benchmark_timing_t pointed to by timing_out (caller must
keep it alive until the complete_callback or error_callback has been invoked),
and that the function will fully initialize the timing struct's client-side
fields (t0 set at API entry, t4 set at first-token callback, t6 set before
complete callback) while backend-only fields (t2, t3, t5) may be filled by the
backend if supported; specify that passing NULL disables timing (zero overhead),
that the function returns RAC_SUCCESS or the documented error codes, and update
the declaration comment for rac_llm_component_generate_stream_with_timing and
any related vtable documentation to reflect these ownership, initialization and
lifetime guarantees.
In `@sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp`:
- Around line 257-265: The timing struct rac_benchmark_timing_t is left with
zero token counts because the wrapper call to
h->text_gen->generate_stream_with_timing passes nullptr for out_prompt_tokens
and the backend never sets prompt/output token counts; update the wrapper around
generate_stream_with_timing to provide a local uint64_t prompt_tokens and
out_tokens (or similar) pointers instead of nullptr, pass those into
generate_stream_with_timing, then after the call populate
timing_out->prompt_tokens and timing_out->output_tokens from those local
counters; update any backend implementation of generate_stream_with_timing to
increment/return those counters so the wrapper can copy them into timing_out
(reference symbols: generate_stream_with_timing, timing_out, out_prompt_tokens,
rac_benchmark_timing_t).
In `@sdk/runanywhere-commons/src/features/llm/rac_llm_service.cpp`:
- Around line 125-149: The fallback path in rac_llm_generate_stream_with_timing
can leave timing_out->t2/t3/t5 with stale values; before calling the non-timing
generate_stream, check if timing_out is non-null and explicitly set
timing_out->t2 = timing_out->t3 = timing_out->t5 = 0 (preserving t0/t4/t6 that
callers may have set) so the header guarantee holds, then call
service->ops->generate_stream(...).
| // Stream using C++ class with timing | ||
| bool success = h->text_gen->generate_stream_with_timing( | ||
| request, | ||
| [callback, user_data](const std::string& token) -> bool { | ||
| return callback(token.c_str(), RAC_FALSE, user_data) == RAC_TRUE; | ||
| }, | ||
| nullptr, // out_prompt_tokens not needed, timing is captured internally | ||
| timing_out // Pass timing struct to backend | ||
| ); |
There was a problem hiding this comment.
Populate timing token counts for the timing-enabled C API.
rac_benchmark_timing_t includes prompt_tokens and output_tokens, but this wrapper passes nullptr for out_prompt_tokens and the backend path doesn’t set counts, so direct C API callers get zeros. Please propagate the counts so the timing struct is complete.
✅ Suggested fix (wrapper + backend)
--- a/sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp
+++ b/sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp
@@
- bool success = h->text_gen->generate_stream_with_timing(
+ int prompt_tokens = 0;
+ bool success = h->text_gen->generate_stream_with_timing(
request,
[callback, user_data](const std::string& token) -> bool {
return callback(token.c_str(), RAC_FALSE, user_data) == RAC_TRUE;
},
- nullptr, // out_prompt_tokens not needed, timing is captured internally
+ &prompt_tokens, // capture prompt token count
timing_out // Pass timing struct to backend
);
+ if (timing_out != nullptr) {
+ timing_out->prompt_tokens = prompt_tokens;
+ }--- a/sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
+++ b/sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
@@
// t5: Record last token time (decode loop exit)
if (timing_out != nullptr) {
timing_out->t5_last_token_ms = rac_monotonic_now_ms();
+ timing_out->prompt_tokens = prompt_tokens;
+ timing_out->output_tokens = tokens_generated;
}🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/backends/llamacpp/rac_llm_llamacpp.cpp` around
lines 257 - 265, The timing struct rac_benchmark_timing_t is left with zero
token counts because the wrapper call to
h->text_gen->generate_stream_with_timing passes nullptr for out_prompt_tokens
and the backend never sets prompt/output token counts; update the wrapper around
generate_stream_with_timing to provide a local uint64_t prompt_tokens and
out_tokens (or similar) pointers instead of nullptr, pass those into
generate_stream_with_timing, then after the call populate
timing_out->prompt_tokens and timing_out->output_tokens from those local
counters; update any backend implementation of generate_stream_with_timing to
increment/return those counters so the wrapper can copy them into timing_out
(reference symbols: generate_stream_with_timing, timing_out, out_prompt_tokens,
rac_benchmark_timing_t).
| rac_result_t rac_llm_generate_stream_with_timing(rac_handle_t handle, const char* prompt, | ||
| const rac_llm_options_t* options, | ||
| rac_llm_stream_callback_fn callback, | ||
| void* user_data, | ||
| rac_benchmark_timing_t* timing_out) { | ||
| if (!handle || !prompt || !callback) | ||
| return RAC_ERROR_NULL_POINTER; | ||
|
|
||
| auto* service = static_cast<rac_llm_service_t*>(handle); | ||
| if (!service->ops) { | ||
| return RAC_ERROR_NOT_SUPPORTED; | ||
| } | ||
|
|
||
| // If backend implements timing-aware streaming, use it | ||
| if (service->ops->generate_stream_with_timing) { | ||
| return service->ops->generate_stream_with_timing(service->impl, prompt, options, callback, | ||
| user_data, timing_out); | ||
| } | ||
|
|
||
| // Fallback to regular streaming (timing_out won't have t2/t3/t5) | ||
| if (service->ops->generate_stream) { | ||
| return service->ops->generate_stream(service->impl, prompt, options, callback, user_data); | ||
| } | ||
|
|
||
| return RAC_ERROR_NOT_SUPPORTED; |
There was a problem hiding this comment.
Initialize timing fields on the non-timing fallback path.
When the backend lacks generate_stream_with_timing, timing_out->t2/t3/t5 can retain stale data from prior calls, which contradicts the header guarantee that these are zeroed on fallback.
✅ Suggested fix (preserves t0/t4/t6 set by callers)
// Fallback to regular streaming (timing_out won't have t2/t3/t5)
if (service->ops->generate_stream) {
+ if (timing_out != nullptr) {
+ timing_out->t2_prefill_start_ms = 0;
+ timing_out->t3_prefill_end_ms = 0;
+ timing_out->t5_last_token_ms = 0;
+ }
return service->ops->generate_stream(service->impl, prompt, options, callback, user_data);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rac_result_t rac_llm_generate_stream_with_timing(rac_handle_t handle, const char* prompt, | |
| const rac_llm_options_t* options, | |
| rac_llm_stream_callback_fn callback, | |
| void* user_data, | |
| rac_benchmark_timing_t* timing_out) { | |
| if (!handle || !prompt || !callback) | |
| return RAC_ERROR_NULL_POINTER; | |
| auto* service = static_cast<rac_llm_service_t*>(handle); | |
| if (!service->ops) { | |
| return RAC_ERROR_NOT_SUPPORTED; | |
| } | |
| // If backend implements timing-aware streaming, use it | |
| if (service->ops->generate_stream_with_timing) { | |
| return service->ops->generate_stream_with_timing(service->impl, prompt, options, callback, | |
| user_data, timing_out); | |
| } | |
| // Fallback to regular streaming (timing_out won't have t2/t3/t5) | |
| if (service->ops->generate_stream) { | |
| return service->ops->generate_stream(service->impl, prompt, options, callback, user_data); | |
| } | |
| return RAC_ERROR_NOT_SUPPORTED; | |
| rac_result_t rac_llm_generate_stream_with_timing(rac_handle_t handle, const char* prompt, | |
| const rac_llm_options_t* options, | |
| rac_llm_stream_callback_fn callback, | |
| void* user_data, | |
| rac_benchmark_timing_t* timing_out) { | |
| if (!handle || !prompt || !callback) | |
| return RAC_ERROR_NULL_POINTER; | |
| auto* service = static_cast<rac_llm_service_t*>(handle); | |
| if (!service->ops) { | |
| return RAC_ERROR_NOT_SUPPORTED; | |
| } | |
| // If backend implements timing-aware streaming, use it | |
| if (service->ops->generate_stream_with_timing) { | |
| return service->ops->generate_stream_with_timing(service->impl, prompt, options, callback, | |
| user_data, timing_out); | |
| } | |
| // Fallback to regular streaming (timing_out won't have t2/t3/t5) | |
| if (service->ops->generate_stream) { | |
| if (timing_out != nullptr) { | |
| timing_out->t2_prefill_start_ms = 0; | |
| timing_out->t3_prefill_end_ms = 0; | |
| timing_out->t5_last_token_ms = 0; | |
| } | |
| return service->ops->generate_stream(service->impl, prompt, options, callback, user_data); | |
| } | |
| return RAC_ERROR_NOT_SUPPORTED; | |
| } |
🤖 Prompt for AI Agents
In `@sdk/runanywhere-commons/src/features/llm/rac_llm_service.cpp` around lines
125 - 149, The fallback path in rac_llm_generate_stream_with_timing can leave
timing_out->t2/t3/t5 with stale values; before calling the non-timing
generate_stream, check if timing_out is non-null and explicitly set
timing_out->t2 = timing_out->t3 = timing_out->t5 = 0 (preserving t0/t4/t6 that
callers may have set) so the header guarantee holds, then call
service->ops->generate_stream(...).
There was a problem hiding this comment.
Pull request overview
This PR implements comprehensive benchmark timing infrastructure for LLM inference in the RunAnywhere C++ Commons layer. It introduces a monotonic clock-based timing system that captures 6 key timestamps (t0, t2, t3, t4, t5, t6) throughout the inference pipeline, enabling detailed performance analysis with zero overhead when timing is disabled.
Changes:
- New benchmark timing infrastructure with monotonic clock using
std::chrono::steady_clock - Component-level timestamps (t0, t4, t6) captured for all backends
- Backend-level timestamps (t2, t3, t5) captured for LlamaCPP backend
- JNI bindings for Android/Kotlin integration with timing data exposed via JSON
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
include/rac/core/rac_benchmark.h |
New header defining benchmark timing struct and monotonic clock API |
src/core/rac_benchmark.cpp |
Implementation of monotonic clock using steady_clock with process-local epoch |
include/rac/features/llm/rac_llm_component.h |
Added generate_stream_with_timing API for component layer |
src/features/llm/llm_component.cpp |
Implements t0, t4, t6 timestamp capture at component boundaries |
include/rac/features/llm/rac_llm_service.h |
Extended service vtable with timing-aware method pointer |
src/features/llm/rac_llm_service.cpp |
Routes timing calls to backend with fallback to regular streaming |
include/rac/backends/rac_llm_llamacpp.h |
C API for LlamaCPP with timing support |
src/backends/llamacpp/rac_llm_llamacpp.cpp |
C API implementation bridging to C++ backend |
src/backends/llamacpp/llamacpp_backend.h |
Extended TextGeneration interface with timing method |
src/backends/llamacpp/llamacpp_backend.cpp |
Implements t2, t3, t5 timestamp capture around llama_decode calls |
src/backends/llamacpp/rac_backend_llamacpp_register.cpp |
Registers timing-aware vtable entry for LlamaCPP backend |
src/jni/runanywhere_commons_jni.cpp |
JNI method for Android exposing timing data as JSON |
CMakeLists.txt |
Added benchmark source file to build configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Java_com_runanywhere_sdk_native_bridge_RunAnywhereBridge_racLlmComponentGenerateStreamWithTiming( | ||
| JNIEnv* env, jclass clazz, jlong handle, jstring prompt, jstring configJson, | ||
| jobject tokenCallback) { |
There was a problem hiding this comment.
The JNI method signature doesn't match the PR description. According to the description, the method should have parameters including complete_callback, error_callback, and timing_callback, but the actual implementation only includes tokenCallback. This discrepancy means the method provides less functionality than documented, as there's no way to explicitly handle completion or errors through dedicated callbacks, and there's no separate timing callback mechanism as suggested in the description.
| if (llama_decode(context_, batch) != 0) { | ||
| LOGE("llama_decode failed for prompt"); | ||
| llama_batch_free(batch); | ||
| return false; |
There was a problem hiding this comment.
The batch resource is not freed on the early return path when llama_decode fails. If llama_decode returns non-zero at line 613, the function returns false at line 616 without calling llama_batch_free, which could cause a memory leak. The batch should be freed before returning.
| bool success = h->text_gen->generate_stream_with_timing( | ||
| request, | ||
| [callback, user_data](const std::string& token) -> bool { | ||
| return callback(token.c_str(), RAC_FALSE, user_data) == RAC_TRUE; | ||
| }, | ||
| nullptr, // out_prompt_tokens not needed, timing is captured internally | ||
| timing_out // Pass timing struct to backend |
There was a problem hiding this comment.
The comment at line 263 states "out_prompt_tokens not needed, timing is captured internally", but the method is being passed nullptr for this parameter. However, looking at the backend implementation, the out_prompt_tokens parameter is actually used - it's checked for NULL before being assigned at lines 581-583. If this parameter is intended to be used, passing nullptr here means prompt token information is lost and cannot be propagated back to the caller through this interface, though it is captured in the timing struct.
| bool success = h->text_gen->generate_stream_with_timing( | |
| request, | |
| [callback, user_data](const std::string& token) -> bool { | |
| return callback(token.c_str(), RAC_FALSE, user_data) == RAC_TRUE; | |
| }, | |
| nullptr, // out_prompt_tokens not needed, timing is captured internally | |
| timing_out // Pass timing struct to backend | |
| int32_t prompt_tokens = 0; | |
| bool success = h->text_gen->generate_stream_with_timing( | |
| request, | |
| [callback, user_data](const std::string& token) -> bool { | |
| return callback(token.c_str(), RAC_FALSE, user_data) == RAC_TRUE; | |
| }, | |
| &prompt_tokens, // capture prompt token count from backend | |
| timing_out // Pass timing struct to backend |
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | ||
| return nullptr; | ||
| } | ||
|
|
There was a problem hiding this comment.
There's a missing call to env->DeleteLocalRef(callbackClass) after getting the method ID. The jclass reference obtained from GetObjectClass at line 1015 should be released to prevent a local reference leak. This is particularly important in JNI methods that may be called frequently, as local references accumulate until the method returns.
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | |
| return nullptr; | |
| } | |
| LOGe("racLlmComponentGenerateStreamWithTiming: could not find onToken method"); | |
| env->DeleteLocalRef(callbackClass); | |
| return nullptr; | |
| } | |
| // callbackClass no longer needed; delete local reference to avoid leaks | |
| env->DeleteLocalRef(callbackClass); |
|
Seems like still WIP ? @abhisekupadhyaya ? Since I don't see any changes to run the benchmarks ? |
|
|
Any updates @abhisekupadhyaya ? |
|
Closing thsi for now, since it's covered in the fllowing opened PR : #343 |
C++ Commons: Benchmark Timing Infrastructure
Summary
This PR implements comprehensive benchmark timing instrumentation in the RunAnywhere C++ Commons layer. It adds the ability to capture precise timestamps at key points during LLM inference, enabling detailed performance analysis and optimization.
Key Feature: Zero-overhead opt-in timing via optional pointer parameter - when not benchmarking, there is no performance impact.
What's Implemented
1. Monotonic Time Infrastructure
New Files:
include/rac/core/rac_benchmark.h- Benchmark timing types and APIssrc/core/rac_benchmark.cpp- Monotonic clock implementationCore API:
Implementation Details:
std::chrono::steady_clockfor monotonic timing2. LLM Component Layer - t0, t4, t6 Capture
Modified Files:
include/rac/features/llm/rac_llm_component.hsrc/features/llm/llm_component.cppNew API:
Timestamp Capture Points:
Implementation:
llm_stream_contextstruct withtiming_outpointerfirst_token_recordedflag3. LLM Service Layer - Timing Propagation
Modified Files:
include/rac/features/llm/rac_llm_service.hsrc/features/llm/rac_llm_service.cppChanges:
rac_llm_service_ops_tvtable withgenerate_stream_with_timingentryrac_llm_generate_stream_with_timing()service functiontiming_outpointer through to backend implementationPurpose: Routes timing-aware calls from component to backend without modifying the timing data.
4. LlamaCPP Backend - t2, t3, t5 Capture
Modified Files:
src/backends/llamacpp/llamacpp_backend.hsrc/backends/llamacpp/llamacpp_backend.cppsrc/backends/llamacpp/rac_backend_llamacpp_register.cppNew Method:
Timestamp Capture Points:
llama_decode()for prompt (prefill start)llama_decode()returns (prefill end)Implementation Details:
llama_decode()call for prompt KV cache fill5. C API Layer for LlamaCPP
Modified Files:
include/rac/backends/rac_llm_llamacpp.hsrc/backends/llamacpp/rac_llm_llamacpp.cppNew C API:
Purpose: Provides C-compatible entry point for LlamaCPP backend with timing support.
6. JNI Bindings for Android/Kotlin
Modified Files:
src/jni/runanywhere_commons_jni.cppNew JNI Method:
JNIEXPORT jobject JNICALL Java_com_runanywhere_sdk_native_bridge_RunAnywhereBridge_racLlmComponentGenerateStreamWithTiming( JNIEnv* env, jobject thiz, jlong handle, jstring prompt, jlong options_handle, jobject token_callback, jobject complete_callback, jobject error_callback, jobject timing_callback );Features:
rac_benchmark_timing_tstruct on C++ sideArchitecture
Timestamp Capture Flow
sequenceDiagram participant App as Application participant Comp as LLM Component participant Svc as LLM Service participant BE as LlamaCPP Backend App->>Comp: generate_stream_with_timing(timing_out) Note over Comp: Capture t0 = now() Comp->>Svc: generate_stream_with_timing(timing_out) Svc->>BE: generate_stream_with_timing(timing_out) Note over BE: Capture t2 = now() BE->>BE: llama_decode(prompt) Note over BE: Capture t3 = now() BE->>Comp: token_callback(token) Note over Comp: First token?<br/>Capture t4 = now() Comp->>App: token_callback(token) BE->>BE: Decode loop... BE->>Comp: More token_callback() Comp->>App: More token_callback() Note over BE: Loop exits<br/>Capture t5 = now() BE-->>Svc: return result Svc-->>Comp: return result Note over Comp: Capture t6 = now()<br/>Fill prompt_tokens, output_tokens Comp->>App: complete_callback() Note over App: Read filled timing_out structLayer Responsibilities
graph TB subgraph Component [LLM Component Layer] T0[t0: API Entry] T4[t4: First Token] T6[t6: Request End] end subgraph Service [LLM Service Layer] Route[Route to Backend] end subgraph Backend [LlamaCPP Backend] T2[t2: Prefill Start] T3[t3: Prefill End] T5[t5: Last Token] end T0 --> Route Route --> T2 T2 --> Prefill[llama_decode] Prefill --> T3 T3 --> DecodeLoop[Token Generation Loop] DecodeLoop --> T4 DecodeLoop --> T5 T5 --> Route Route --> T6 style T0 fill:#e1f5ff style T4 fill:#e1f5ff style T6 fill:#e1f5ff style T2 fill:#fff4e1 style T3 fill:#fff4e1 style T5 fill:#fff4e1Design Rationale: Why Timestamps Are Captured Where They Are
llama_decode)llama_decodereturns)Key Insight: Each timestamp is captured at the only layer that can observe that moment in the call stack.
Implementation Details
Zero-Overhead Design
When
timing_out == NULL:Benchmark impact when timing disabled: <1ns per call (within measurement noise).
Monotonic Clock Selection
Why
std::chrono::steady_clock?Alternative considered:
std::chrono::system_clockProcess-Local Epoch
Instead of Unix epoch (1970), uses first call time as epoch:
Thread Safety
All timing functions are thread-safe without locks:
rac_monotonic_now_ms()- reads fromsteady_clock(atomic on all platforms)MonotonicEpoch- static initialization is thread-safe in C++11+timing_outstructTesting
Unit Test Coverage
The implementation should be tested with:
Monotonic clock tests
Timing capture tests
Integration tests
Manual Testing
Expected timing struct after successful run:
Performance Impact
Timing Collection Overhead
rac_monotonic_now_ms()steady_clock::now()callMemory Overhead
rac_benchmark_timing_tstructWhen Timing is Disabled
API Usage Examples
C API
Without Timing (Zero Overhead)
Metrics Derivation
From the 6 captured timestamps, these metrics can be derived:
Files Changed Summary
New Files (2)
include/rac/core/rac_benchmark.hsrc/core/rac_benchmark.cppModified Files (11)
CMakeLists.txtinclude/rac/backends/rac_llm_llamacpp.hinclude/rac/features/llm/rac_llm_component.hinclude/rac/features/llm/rac_llm_service.hsrc/backends/llamacpp/llamacpp_backend.hsrc/backends/llamacpp/llamacpp_backend.cppsrc/backends/llamacpp/rac_backend_llamacpp_register.cppsrc/backends/llamacpp/rac_llm_llamacpp.cppsrc/features/llm/llm_component.cppsrc/features/llm/rac_llm_service.cppsrc/jni/runanywhere_commons_jni.cppTotal: 13 files changed, 888 lines added, 0 lines deleted
Backend Support
Current Support
Note: t0, t4, t6 are captured at the component layer and work for all backends. Backend-specific timestamps (t2, t3, t5) are currently only implemented for LlamaCPP.
Adding Timing to Other Backends
To add t2/t3/t5 support to a backend:
generate_stream_with_timing()methodBreaking Changes
None. This PR is fully backward compatible:
generate_stream()API unchangedgenerate_stream_with_timing()is additional, not replacementTo Do
Potential Enhancements
Extended Metrics
Additional Backends
Automatic Logging
Statistical Analysis
To be commited
This PR provides the C++ foundation for benchmark timing. Future PRs will add:
Design Principles
Validation Checklist
Before merging, verify:
Summary
This PR implements a production-ready benchmark timing infrastructure for RunAnywhere Commons. It captures 6 key timestamps during LLM inference with zero overhead when not benchmarking.
Key Features:
std::chrono::steady_clockImpact:
Important
This PR adds a zero-overhead, opt-in benchmark timing infrastructure for LLM inference in the C++ Commons layer, capturing key timestamps during inference and integrating with JNI for Android support.
rac_benchmark_timing_tinrac_benchmark.h.t0(request start),t2(prefill start),t3(prefill end),t4(first token),t5(last token),t6(request end).llm_component.cppandllamacpp_backend.cpp.rac_llm_component_generate_stream_with_timing()inrac_llm_component.hfor streaming with timing.generate_stream_with_timing()method inLlamaCppTextGenerationinllamacpp_backend.cpp.rac_llm_llamacpp_generate_stream_with_timing()inrac_llm_llamacpp.cppfor C API timing support.racLlmComponentGenerateStreamWithTiming()inrunanywhere_commons_jni.cppfor Android integration.std::chrono::steady_clockinrac_benchmark.cpp.CMakeLists.txtto includerac_benchmark.cppin the build.This description was created by
for afe1f80. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
Greptile Overview
Greptile Summary
This PR adds a new benchmark timing facility (
rac_benchmark_timing_t+rac_monotonic_now_ms) and threads an optionaltiming_outpointer through the LLM component → service → backend stack. LlamaCPP implements backend-side capture (t2/t3/t5) while the component records request/first-token/end timestamps (t0/t4/t6). JNI adds an Android entry point that returns timing fields in a JSON payload.Key integration points are the new
generate_stream_with_timingvtable slot inrac_llm_service_ops_tand the fallback behavior inrac_llm_generate_stream_with_timing()when a backend doesn’t implement it.Issues to address before merge:
timing.statusis documented as arac_result_t-style error code, but current writers storeRAC_BENCHMARK_STATUS_*values.Confidence Score: 3/5
rac_benchmark_timing_t::status. Backend timing is also left partially populated on some failure paths, which can mislead metric derivations.Important Files Changed
src/core/rac_benchmark.cppto build; no functional risk beyond compilation linkage.rac_llm_llamacpp_generate_stream_with_timing; appears consistent with existing streaming API.statusconflict with current writers (component uses RAC_BENCHMARK_STATUS_*).rac_llm_component_generate_stream_with_timingAPI; header looks consistent with existing component streaming surface.rac_llm_generate_stream_with_timing; clean fallback behavior when backend doesn’t implement timing.generate_stream_with_timingdeclaration to LlamaCppTextGeneration; signature matches implementation.rac_llm_llamacpp_generate_stream_with_timingby calling C++ backend timing method; no obvious ABI issues.rac_monotonic_now_mswith steady_clock and a process-local epoch + struct init helper; straightforward.rac_llm_generate_stream_with_timingwith fallback to non-timing streaming; OK.Sequence Diagram
sequenceDiagram participant App as App/JNI caller participant JNI as runanywhere_commons_jni participant Comp as LLM Component participant Svc as LLM Service participant BE as LlamaCPP Backend App->>JNI: racLlmComponentGenerateStreamWithTiming() JNI->>Comp: rac_llm_component_generate_stream_with_timing(timing_out*) Note over Comp: t0 = rac_monotonic_now_ms() if timing_out != NULL Comp->>Svc: rac_llm_generate_stream_with_timing(timing_out*) alt backend supports timing Svc->>BE: generate_stream_with_timing(timing_out*) Note over BE: t2 before prompt decode BE->>BE: llama_decode(prompt) Note over BE: t3 after prompt decode loop token generation BE-->>Comp: stream callback(token) Note over Comp: first token => t4 now Comp-->>JNI: token callback end Note over BE: t5 at loop exit BE-->>Svc: return else fallback Svc->>BE: generate_stream() end Note over Comp: t6 before complete callback Comp-->>JNI: complete callback JNI-->>App: JSON with t0..t6(2/5) Greptile learns from your feedback when you react with thumbs up/down!
Context used:
dashboard- CLAUDE.md (source)