Conversation
Fixes model loading issue due to huge buffer allocation.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughReplaced heuristic GPU memory planning with explicit llama_params_fit-based fitting, supporting user-provided GPU layer overrides and context sizing. Enhanced error handling for memory fit outcomes (success/failure/error). Expanded LoRA adapter management with new methods for applying, removing, and querying adapters while maintaining proper context recreation and memory tracking. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@Siddhesh2377 can you pls test for android? |
Just a sec |
|
@coderabbitai please do a detailed review! |
|
@greptile please do a detailed review! |
|
✅ Actions performedFull review triggered. |
| #if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU) | ||
| if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) { | ||
| LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. " | ||
| "Applying conservative CPU defaults."); | ||
| } | ||
| model_params.n_gpu_layers = 0; | ||
| if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) { | ||
| ctx_params.n_ctx = 4096; | ||
| LOGI("CPU-only: capping context to %u", ctx_params.n_ctx); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
CPU-only block unconditionally overwrites user gpu_layers override
The user_gpu_layers override is applied at lines 283–287, but the CPU-only preprocessor block immediately after (lines 292–302) unconditionally sets model_params.n_gpu_layers = 0, silently discarding the user's explicit config. The comment at line 210 states "If llama_params_fits aborts, use the user provided value," but this intent is not honored in CPU-only builds.
Consider moving the user override application to after the CPU-only block, or at minimum logging a warning that the user override was ignored:
#if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU)
if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. "
"Applying conservative CPU defaults.");
}
model_params.n_gpu_layers = 0;
if (user_gpu_layers >= 0) {
LOGI("CPU-only build: ignoring user-provided gpu_layers=%d (no GPU backend active)", user_gpu_layers);
}
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) {
ctx_params.n_ctx = 4096;
LOGI("CPU-only: capping context to %u", ctx_params.n_ctx);
}
#else
// Apply user gpu_layers override after fit (GPU builds only)
if (user_gpu_layers >= 0) {
model_params.n_gpu_layers = user_gpu_layers;
LOGI("Applying user GPU layers override: %d", user_gpu_layers);
}
#endifPrompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 292-302
Comment:
**CPU-only block unconditionally overwrites user `gpu_layers` override**
The `user_gpu_layers` override is applied at lines 283–287, but the CPU-only preprocessor block immediately after (lines 292–302) unconditionally sets `model_params.n_gpu_layers = 0`, silently discarding the user's explicit config. The comment at line 210 states "If llama_params_fits aborts, use the user provided value," but this intent is not honored in CPU-only builds.
Consider moving the user override application to after the CPU-only block, or at minimum logging a warning that the user override was ignored:
```cpp
#if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU)
if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. "
"Applying conservative CPU defaults.");
}
model_params.n_gpu_layers = 0;
if (user_gpu_layers >= 0) {
LOGI("CPU-only build: ignoring user-provided gpu_layers=%d (no GPU backend active)", user_gpu_layers);
}
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) {
ctx_params.n_ctx = 4096;
LOGI("CPU-only: capping context to %u", ctx_params.n_ctx);
}
#else
// Apply user gpu_layers override after fit (GPU builds only)
if (user_gpu_layers >= 0) {
model_params.n_gpu_layers = user_gpu_layers;
LOGI("Applying user GPU layers override: %d", user_gpu_layers);
}
#endif
```
How can I resolve this? If you propose a fix, please make it concise.| // Most 7B models have 32 layers, offload ~24 to GPU, rest to CPU | ||
| gpu_layers = 24; | ||
| LOGI("Large model detected, limiting GPU layers to %d to prevent OOM", gpu_layers); | ||
| uint32_t n_ctx_min = 2048; // Configurable parameter |
There was a problem hiding this comment.
n_ctx_min marked configurable but not wired to config
The comment says // Configurable parameter, but unlike fit_margin_mib (which has a corresponding config.contains("fit_margin_mib") lookup), n_ctx_min is always hardcoded to 2048. If the intent is to allow callers to tune the minimum context threshold, the config read is missing:
| uint32_t n_ctx_min = 2048; // Configurable parameter | |
| uint32_t n_ctx_min = 2048; // Configurable parameter | |
| if (config.contains("fit_n_ctx_min")) { | |
| n_ctx_min = config["fit_n_ctx_min"].get<uint32_t>(); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 244
Comment:
**`n_ctx_min` marked configurable but not wired to config**
The comment says `// Configurable parameter`, but unlike `fit_margin_mib` (which has a corresponding `config.contains("fit_margin_mib")` lookup), `n_ctx_min` is always hardcoded to `2048`. If the intent is to allow callers to tune the minimum context threshold, the config read is missing:
```suggestion
uint32_t n_ctx_min = 2048; // Configurable parameter
if (config.contains("fit_n_ctx_min")) {
n_ctx_min = config["fit_n_ctx_min"].get<uint32_t>();
}
```
How can I resolve this? If you propose a fix, please make it concise.| case LLAMA_PARAMS_FIT_STATUS_FAILURE: | ||
| LOGI("llama_params_fit FAILURE: could not fit model to device memory. " | ||
| "Proceeding with conservative CPU-only defaults."); | ||
| model_params.n_gpu_layers = 0; | ||
| if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) { | ||
| ctx_params.n_ctx = 2048; | ||
| } | ||
| break; | ||
| case LLAMA_PARAMS_FIT_STATUS_ERROR: | ||
| LOGE("llama_params_fit ERROR for model: %s. " | ||
| "Falling back to conservative CPU-only defaults.", model_path.c_str()); | ||
| model_params.n_gpu_layers = 0; | ||
| if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) { | ||
| ctx_params.n_ctx = 2048; | ||
| } | ||
| break; |
There was a problem hiding this comment.
User-provided context size silently overridden on fit failure
If the user explicitly sets context_size in their config (e.g., user_context_size = 3000), it is pre-populated into ctx_params.n_ctx at line 225. However, when llama_params_fit returns FAILURE or ERROR, the fallback guard ctx_params.n_ctx > 2048 will silently cap it to 2048 without any log message specific to the user's request being overridden. The general failure message doesn't make it obvious that the requested context was discarded.
Consider adding an explicit log when the user-supplied context is reduced:
case LLAMA_PARAMS_FIT_STATUS_FAILURE:
LOGI("llama_params_fit FAILURE: could not fit model to device memory. "
"Proceeding with conservative CPU-only defaults.");
model_params.n_gpu_layers = 0;
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
if (user_context_size > 2048) {
LOGI("Ignoring user-requested context size %d due to fit failure; capping at 2048", user_context_size);
}
ctx_params.n_ctx = 2048;
}
break;Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 265-280
Comment:
**User-provided context size silently overridden on fit failure**
If the user explicitly sets `context_size` in their config (e.g., `user_context_size = 3000`), it is pre-populated into `ctx_params.n_ctx` at line 225. However, when `llama_params_fit` returns `FAILURE` or `ERROR`, the fallback guard `ctx_params.n_ctx > 2048` will silently cap it to 2048 without any log message specific to the user's request being overridden. The general failure message doesn't make it obvious that the requested context was discarded.
Consider adding an explicit log when the user-supplied context is reduced:
```cpp
case LLAMA_PARAMS_FIT_STATUS_FAILURE:
LOGI("llama_params_fit FAILURE: could not fit model to device memory. "
"Proceeding with conservative CPU-only defaults.");
model_params.n_gpu_layers = 0;
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
if (user_context_size > 2048) {
LOGI("Ignoring user-requested context size %d due to fit failure; capping at 2048", user_context_size);
}
ctx_params.n_ctx = 2048;
}
break;
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (3)
887-889:⚠️ Potential issue | 🟡 MinorPotential null pointer dereference:
llama_get_memorymay returnnullptr.The call to
llama_memory_cleardoesn't check ifllama_get_memory(context_)returnsnullptr. Elsewhere in this file (lines 561-564 and 745-747), the result is checked before use.🛡️ Proposed fix
// Clear KV cache after adapter changes - llama_memory_clear(llama_get_memory(context_), true); + if (llama_memory_t mem = llama_get_memory(context_)) { + llama_memory_clear(mem, true); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around lines 887 - 889, The call to llama_memory_clear uses llama_get_memory(context_) without null-checking, risking a null pointer dereference; update the code around where llama_memory_clear(llama_get_memory(context_), true) is called so you first call auto* mem = llama_get_memory(context_) and only call llama_memory_clear(mem, true) if mem != nullptr (mirror the null-check pattern used earlier in this file around llama_get_memory), optionally handling or logging the null case instead of calling into llama_memory_clear with a null pointer.
936-939:⚠️ Potential issue | 🟡 MinorSame null pointer issue: guard
llama_memory_clearcall.Although
context_is verified non-null,llama_get_memorycan still returnnullptr.🛡️ Proposed fix
if (context_) { llama_clear_adapter_lora(context_); - llama_memory_clear(llama_get_memory(context_), true); + if (llama_memory_t mem = llama_get_memory(context_)) { + llama_memory_clear(mem, true); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around lines 936 - 939, The code calls llama_memory_clear(llama_get_memory(context_), true) after checking context_ but llama_get_memory(context_) may return nullptr; modify the block around context_ (the code invoking llama_clear_adapter_lora and llama_memory_clear) to first retrieve and check the memory pointer (e.g., auto mem = llama_get_memory(context_)) and only call llama_memory_clear(mem, true) if mem is non-null, while still calling llama_clear_adapter_lora(context_) unconditionally when context_ is valid.
922-924:⚠️ Potential issue | 🟡 MinorSame null pointer issue: guard
llama_memory_clearcall.🛡️ Proposed fix
// Clear KV cache after adapter changes - llama_memory_clear(llama_get_memory(context_), true); + if (llama_memory_t mem = llama_get_memory(context_)) { + llama_memory_clear(mem, true); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around lines 922 - 924, Guard the llama_memory_clear call to avoid null deref by checking that context_ and the result of llama_get_memory(context_) are non-null before calling llama_memory_clear; i.e., retrieve auto mem = llama_get_memory(context_) (or otherwise ensure context_ is valid) and only call llama_memory_clear(mem, true) when mem is not null, so the clear is skipped when there is no memory to operate on.
🧹 Nitpick comments (1)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)
326-331: Consider extracting magic number2048to a named constant.The value
2048formax_safe_batchappears multiple times in this file (here and inrecreate_contextat line 800). Consider defining a class constant likekDefaultMaxBatchSizefor maintainability.♻️ Suggested refactor
Add a constant at class scope (in the header) or at file scope:
static constexpr int kDefaultMaxBatchSize = 2048;Then use it in both locations:
- int max_safe_batch = 2048; // Configurable parameter + int max_safe_batch = kDefaultMaxBatchSize; // Configurable parameter🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around lines 326 - 331, Extract the magic number 2048 into a named constant (e.g., static constexpr int kDefaultMaxBatchSize = 2048) and replace local uses like max_safe_batch in the llm initialization block (where ctx_params.n_batch/n_ubatch are set) and the recreate_context function with that constant; update references to use kDefaultMaxBatchSize (or a class-scoped equivalent) so both the block that sets safe_batch_size (using context_size_) and the recreate_context implementation share the same symbolic value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 887-889: The call to llama_memory_clear uses
llama_get_memory(context_) without null-checking, risking a null pointer
dereference; update the code around where
llama_memory_clear(llama_get_memory(context_), true) is called so you first call
auto* mem = llama_get_memory(context_) and only call llama_memory_clear(mem,
true) if mem != nullptr (mirror the null-check pattern used earlier in this file
around llama_get_memory), optionally handling or logging the null case instead
of calling into llama_memory_clear with a null pointer.
- Around line 936-939: The code calls
llama_memory_clear(llama_get_memory(context_), true) after checking context_ but
llama_get_memory(context_) may return nullptr; modify the block around context_
(the code invoking llama_clear_adapter_lora and llama_memory_clear) to first
retrieve and check the memory pointer (e.g., auto mem =
llama_get_memory(context_)) and only call llama_memory_clear(mem, true) if mem
is non-null, while still calling llama_clear_adapter_lora(context_)
unconditionally when context_ is valid.
- Around line 922-924: Guard the llama_memory_clear call to avoid null deref by
checking that context_ and the result of llama_get_memory(context_) are non-null
before calling llama_memory_clear; i.e., retrieve auto mem =
llama_get_memory(context_) (or otherwise ensure context_ is valid) and only call
llama_memory_clear(mem, true) when mem is not null, so the clear is skipped when
there is no memory to operate on.
---
Nitpick comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 326-331: Extract the magic number 2048 into a named constant
(e.g., static constexpr int kDefaultMaxBatchSize = 2048) and replace local uses
like max_safe_batch in the llm initialization block (where
ctx_params.n_batch/n_ubatch are set) and the recreate_context function with that
constant; update references to use kDefaultMaxBatchSize (or a class-scoped
equivalent) so both the block that sets safe_batch_size (using context_size_)
and the recreate_context implementation share the same symbolic value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88fc32f8-658a-4eaa-ac5d-7e9ccad8b1c2
📒 Files selected for processing (1)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
|
@Siddhesh2377 can u do a quick sanity test for this for Android |

Fixes model loading issue due to huge buffer allocation.
Switched to using llama_params_fit, which queries the user's memory to ensure context and layers offloaded actually fit.
Currently it doesnt work for cpu only devices but there is a pr open for it, which when merged, we should also fix.
[https://github.com/ggml-org/llama.cpp/pull/19711]
Greptile Summary
This PR replaces a filename-heuristic-based GPU layer limiting strategy with
llama_params_fit, which queries actual device memory to determine how many layers can safely be offloaded to GPU and what context size fits. The batch size is also capped at 2048 to avoid oversized buffer allocations. This is a meaningful improvement over the previous approach, but has a few issues worth addressing:gpu_layersoverride: The user override (user_gpu_layers) is applied at lines 283–287, but the immediately following CPU-only preprocessor block (#if !defined(GGML_USE_METAL) && ...) unconditionally setsmodel_params.n_gpu_layers = 0, silently throwing away the user's explicit config value with no log warning.n_ctx_minannotated as configurable but not wired to config: The variable on line 244 has the comment// Configurable parameterbut, unlikefit_margin_mib, there is no correspondingconfig.contains(...)lookup — making the comment misleading.llama_params_fitreturnsFAILUREorERROR, the user's explicitly configuredcontext_sizeis capped without a specific log message calling it out, which can make debugging confusing.Confidence Score: 3/5
llama_params_fitintegration is sound and the approach is a clear improvement. However, the CPU-only preprocessor block unconditionally overwrites a user override that was just applied, which is a real correctness bug for CPU-only deployments. The missing config wire-up forn_ctx_minand silent context override on failure are lower-severity issues. Score reflects that GPU builds are likely fine but CPU-only builds have a confirmed logic issue.gpu_layersoverride relative to the CPU-only#ifblock.Important Files Changed
llama_params_fit, which queries actual device memory. The CPU-only preprocessor block unconditionally overwrites the usergpu_layersoverride applied just above it, andn_ctx_minis annotated as configurable but never read from the config object.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[load_model called] --> B[Read user_gpu_layers & user_context_size from config] B --> C[Init llama_context_params\nSet n_ctx = user_context_size if provided] C --> D[Call llama_params_fit] D --> E{fit_status?} E -->|SUCCESS| F[Use fitted n_gpu_layers & n_ctx] E -->|FAILURE| G[n_gpu_layers=0\nCap n_ctx to 2048] E -->|ERROR| H[n_gpu_layers=0\nCap n_ctx to 2048] F --> I{user_gpu_layers >= 0?} G --> I H --> I I -->|Yes| J[Override n_gpu_layers with user value] I -->|No| K[Keep fitted/fallback value] J --> L{CPU-only build?\nno Metal/CUDA/WebGPU} K --> L L -->|Yes| M[⚠️ Unconditionally set n_gpu_layers=0\nOverwrites user override!] L -->|No GPU build| N[Proceed with fitted params] M --> O[llama_model_load_from_file] N --> O O --> P{model loaded?} P -->|No| Q[Return false] P -->|Yes| R[Resolve final context_size_\nmin of fitted/train/cap] R --> S[Set safe batch size\nmin of context_size_ and 2048] S --> T[llama_init_from_model] T --> U{context created?} U -->|No| V[Free model, return false] U -->|Yes| W[Model loaded successfully]Last reviewed commit: 8508673
Summary by CodeRabbit
Release Notes
New Features
Improvements