Skip to content

Quick Model Loading fix#444

Draft
VyasGuru wants to merge 2 commits intoRunanywhereAI:mainfrom
VyasGuru:Fix-big-model-issue
Draft

Quick Model Loading fix#444
VyasGuru wants to merge 2 commits intoRunanywhereAI:mainfrom
VyasGuru:Fix-big-model-issue

Conversation

@VyasGuru
Copy link
Copy Markdown
Collaborator

@VyasGuru VyasGuru commented Mar 4, 2026

Fixes model loading issue due to huge buffer allocation.

Switched to using llama_params_fit, which queries the user's memory to ensure context and layers offloaded actually fit.

Currently it doesnt work for cpu only devices but there is a pr open for it, which when merged, we should also fix.
[https://github.com/ggml-org/llama.cpp/pull/19711]

Greptile Summary

This PR replaces a filename-heuristic-based GPU layer limiting strategy with llama_params_fit, which queries actual device memory to determine how many layers can safely be offloaded to GPU and what context size fits. The batch size is also capped at 2048 to avoid oversized buffer allocations. This is a meaningful improvement over the previous approach, but has a few issues worth addressing:

  • CPU-only build silently discards user gpu_layers override: The user override (user_gpu_layers) is applied at lines 283–287, but the immediately following CPU-only preprocessor block (#if !defined(GGML_USE_METAL) && ...) unconditionally sets model_params.n_gpu_layers = 0, silently throwing away the user's explicit config value with no log warning.
  • n_ctx_min annotated as configurable but not wired to config: The variable on line 244 has the comment // Configurable parameter but, unlike fit_margin_mib, there is no corresponding config.contains(...) lookup — making the comment misleading.
  • User-requested context size silently capped on fit failure: When llama_params_fit returns FAILURE or ERROR, the user's explicitly configured context_size is capped without a specific log message calling it out, which can make debugging confusing.

Confidence Score: 3/5

  • Safe to merge for most GPU-enabled targets, but contains a logic bug that silently ignores user config in CPU-only builds.
  • The core llama_params_fit integration is sound and the approach is a clear improvement. However, the CPU-only preprocessor block unconditionally overwrites a user override that was just applied, which is a real correctness bug for CPU-only deployments. The missing config wire-up for n_ctx_min and silent context override on failure are lower-severity issues. Score reflects that GPU builds are likely fine but CPU-only builds have a confirmed logic issue.
  • sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp — specifically the ordering of the user gpu_layers override relative to the CPU-only #if block.

Important Files Changed

Filename Overview
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp Replaces fragile filename-heuristic GPU layer limiting with llama_params_fit, which queries actual device memory. The CPU-only preprocessor block unconditionally overwrites the user gpu_layers override applied just above it, and n_ctx_min is annotated as configurable but never read from the config object.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[load_model called] --> B[Read user_gpu_layers & user_context_size from config]
    B --> C[Init llama_context_params\nSet n_ctx = user_context_size if provided]
    C --> D[Call llama_params_fit]
    D --> E{fit_status?}
    E -->|SUCCESS| F[Use fitted n_gpu_layers & n_ctx]
    E -->|FAILURE| G[n_gpu_layers=0\nCap n_ctx to 2048]
    E -->|ERROR| H[n_gpu_layers=0\nCap n_ctx to 2048]
    F --> I{user_gpu_layers >= 0?}
    G --> I
    H --> I
    I -->|Yes| J[Override n_gpu_layers with user value]
    I -->|No| K[Keep fitted/fallback value]
    J --> L{CPU-only build?\nno Metal/CUDA/WebGPU}
    K --> L
    L -->|Yes| M[⚠️ Unconditionally set n_gpu_layers=0\nOverwrites user override!]
    L -->|No GPU build| N[Proceed with fitted params]
    M --> O[llama_model_load_from_file]
    N --> O
    O --> P{model loaded?}
    P -->|No| Q[Return false]
    P -->|Yes| R[Resolve final context_size_\nmin of fitted/train/cap]
    R --> S[Set safe batch size\nmin of context_size_ and 2048]
    S --> T[llama_init_from_model]
    T --> U{context created?}
    U -->|No| V[Free model, return false]
    U -->|Yes| W[Model loaded successfully]
Loading

Last reviewed commit: 8508673

Greptile also left 3 inline comments on this PR.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configurable GPU layer override support
    • Enhanced LoRA adapter management with detailed per-adapter metadata tracking
  • Improvements

    • Replaced heuristic-based memory planning with explicit fitting algorithm for better reliability
    • Improved context sizing and batch calculation for model generation
    • Enhanced logging for improved troubleshooting and observability

Fixes model loading issue due to huge buffer allocation.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a308ee9b-79d0-494a-b5b7-9f51500d84a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced heuristic GPU memory planning with explicit llama_params_fit-based fitting, supporting user-provided GPU layer overrides and context sizing. Enhanced error handling for memory fit outcomes (success/failure/error). Expanded LoRA adapter management with new methods for applying, removing, and querying adapters while maintaining proper context recreation and memory tracking.

Changes

Cohort / File(s) Summary
Memory Fitting & GPU Configuration
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Replaced heuristic-based memory checks with explicit llama_params_fit flow; added early llama_context_params initialization, memory fitting scaffolding (n_devices, tensor_split, margins), and status-based handling (SUCCESS/FAILURE/ERROR) with fallback to CPU-only defaults and context capping.
Context & Batch Sizing
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Refactored context setup during model loading and recreate_context paths; compute final context size using ctx_params and model_train_ctx; introduced safe_batch_size calculation applied to both n_batch and n_ubatch for consistent token handling.
LoRA Adapter Management
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Expanded LoRA handling with new methods: apply_lora_adapters, remove_lora_adapter, clear_lora_adapters, get_lora_info; added per-adapter metadata (path, scale, applied status), context recreation after adapter loading, and memory cache clearing.
Logging & Observability
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Reworked logging throughout memory fitting decisions, fit results, final context sizing, model loading, and per-request parameter handling to reflect fitted outcomes rather than heuristic guesses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 Hopping through memory, fitting it tight,
GPU layers aligned, context just right,
LoRA adapters now dance in the flow,
Better batch sizing—watch the tokens go!
No more guessing games, just math precise—
Our llamacpp backend now runs like a bunny's paradise! 🐇✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description addresses the core issue and rationale but is largely missing the required template sections including Type of Change, Testing, Platform-Specific Testing, Labels, and Checklist items. Complete the PR description by filling in all required template sections: specify 'Bug fix' as the change type, document testing steps across platforms, add appropriate labels (Commons), and check off the provided checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Quick Model Loading fix' is vague and does not clearly convey the main change; it lacks specificity about the core improvement (llama_params_fit integration) and buffer allocation fix. Consider a more specific title such as 'Replace heuristic GPU memory planning with llama_params_fit' to better communicate the primary technical change.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@VyasGuru
Copy link
Copy Markdown
Collaborator Author

VyasGuru commented Mar 4, 2026

@Siddhesh2377 can you pls test for android?

@Siddhesh2377
Copy link
Copy Markdown
Collaborator

@Siddhesh2377 can you pls test for android?

Just a sec

@Siddhesh2377
Copy link
Copy Markdown
Collaborator

image Working !

@shubhammalhotra28
Copy link
Copy Markdown
Contributor

@coderabbitai please do a detailed review!

@shubhammalhotra28
Copy link
Copy Markdown
Contributor

@greptile please do a detailed review!

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 5, 2026

@shubhammalhotra28 Sure! I'll kick off a detailed review of this PR right away. I'll analyze the model loading fix, the switch to llama_params_fit, memory handling implications, and the known CPU-only device limitation.

✅ Actions performed

Full review triggered.

Comment on lines +292 to +302
#if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU)
if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. "
"Applying conservative CPU defaults.");
}
model_params.n_gpu_layers = 0;
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) {
ctx_params.n_ctx = 4096;
LOGI("CPU-only: capping context to %u", ctx_params.n_ctx);
}
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU-only block unconditionally overwrites user gpu_layers override

The user_gpu_layers override is applied at lines 283–287, but the CPU-only preprocessor block immediately after (lines 292–302) unconditionally sets model_params.n_gpu_layers = 0, silently discarding the user's explicit config. The comment at line 210 states "If llama_params_fits aborts, use the user provided value," but this intent is not honored in CPU-only builds.

Consider moving the user override application to after the CPU-only block, or at minimum logging a warning that the user override was ignored:

#if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU)
    if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
        LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. "
             "Applying conservative CPU defaults.");
    }
    model_params.n_gpu_layers = 0;
    if (user_gpu_layers >= 0) {
        LOGI("CPU-only build: ignoring user-provided gpu_layers=%d (no GPU backend active)", user_gpu_layers);
    }
    if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) {
        ctx_params.n_ctx = 4096;
        LOGI("CPU-only: capping context to %u", ctx_params.n_ctx);
    }
#else
    // Apply user gpu_layers override after fit (GPU builds only)
    if (user_gpu_layers >= 0) {
        model_params.n_gpu_layers = user_gpu_layers;
        LOGI("Applying user GPU layers override: %d", user_gpu_layers);
    }
#endif
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 292-302

Comment:
**CPU-only block unconditionally overwrites user `gpu_layers` override**

The `user_gpu_layers` override is applied at lines 283–287, but the CPU-only preprocessor block immediately after (lines 292–302) unconditionally sets `model_params.n_gpu_layers = 0`, silently discarding the user's explicit config. The comment at line 210 states "If llama_params_fits aborts, use the user provided value," but this intent is not honored in CPU-only builds.

Consider moving the user override application to after the CPU-only block, or at minimum logging a warning that the user override was ignored:

```cpp
#if !defined(GGML_USE_METAL) && !defined(GGML_USE_CUDA) && !defined(GGML_USE_WEBGPU)
    if (fit_status == LLAMA_PARAMS_FIT_STATUS_SUCCESS) {
        LOGI("CPU-only build: llama_params_fit fitted to GPU memory but no GPU backend active. "
             "Applying conservative CPU defaults.");
    }
    model_params.n_gpu_layers = 0;
    if (user_gpu_layers >= 0) {
        LOGI("CPU-only build: ignoring user-provided gpu_layers=%d (no GPU backend active)", user_gpu_layers);
    }
    if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 4096) {
        ctx_params.n_ctx = 4096;
        LOGI("CPU-only: capping context to %u", ctx_params.n_ctx);
    }
#else
    // Apply user gpu_layers override after fit (GPU builds only)
    if (user_gpu_layers >= 0) {
        model_params.n_gpu_layers = user_gpu_layers;
        LOGI("Applying user GPU layers override: %d", user_gpu_layers);
    }
#endif
```

How can I resolve this? If you propose a fix, please make it concise.

// Most 7B models have 32 layers, offload ~24 to GPU, rest to CPU
gpu_layers = 24;
LOGI("Large model detected, limiting GPU layers to %d to prevent OOM", gpu_layers);
uint32_t n_ctx_min = 2048; // Configurable parameter
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_ctx_min marked configurable but not wired to config

The comment says // Configurable parameter, but unlike fit_margin_mib (which has a corresponding config.contains("fit_margin_mib") lookup), n_ctx_min is always hardcoded to 2048. If the intent is to allow callers to tune the minimum context threshold, the config read is missing:

Suggested change
uint32_t n_ctx_min = 2048; // Configurable parameter
uint32_t n_ctx_min = 2048; // Configurable parameter
if (config.contains("fit_n_ctx_min")) {
n_ctx_min = config["fit_n_ctx_min"].get<uint32_t>();
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 244

Comment:
**`n_ctx_min` marked configurable but not wired to config**

The comment says `// Configurable parameter`, but unlike `fit_margin_mib` (which has a corresponding `config.contains("fit_margin_mib")` lookup), `n_ctx_min` is always hardcoded to `2048`. If the intent is to allow callers to tune the minimum context threshold, the config read is missing:

```suggestion
    uint32_t n_ctx_min = 2048;  // Configurable parameter
    if (config.contains("fit_n_ctx_min")) {
        n_ctx_min = config["fit_n_ctx_min"].get<uint32_t>();
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +265 to +280
case LLAMA_PARAMS_FIT_STATUS_FAILURE:
LOGI("llama_params_fit FAILURE: could not fit model to device memory. "
"Proceeding with conservative CPU-only defaults.");
model_params.n_gpu_layers = 0;
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
ctx_params.n_ctx = 2048;
}
break;
case LLAMA_PARAMS_FIT_STATUS_ERROR:
LOGE("llama_params_fit ERROR for model: %s. "
"Falling back to conservative CPU-only defaults.", model_path.c_str());
model_params.n_gpu_layers = 0;
if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
ctx_params.n_ctx = 2048;
}
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-provided context size silently overridden on fit failure

If the user explicitly sets context_size in their config (e.g., user_context_size = 3000), it is pre-populated into ctx_params.n_ctx at line 225. However, when llama_params_fit returns FAILURE or ERROR, the fallback guard ctx_params.n_ctx > 2048 will silently cap it to 2048 without any log message specific to the user's request being overridden. The general failure message doesn't make it obvious that the requested context was discarded.

Consider adding an explicit log when the user-supplied context is reduced:

case LLAMA_PARAMS_FIT_STATUS_FAILURE:
    LOGI("llama_params_fit FAILURE: could not fit model to device memory. "
         "Proceeding with conservative CPU-only defaults.");
    model_params.n_gpu_layers = 0;
    if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
        if (user_context_size > 2048) {
            LOGI("Ignoring user-requested context size %d due to fit failure; capping at 2048", user_context_size);
        }
        ctx_params.n_ctx = 2048;
    }
    break;
Prompt To Fix With AI
This is a comment left during a code review.
Path: sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp
Line: 265-280

Comment:
**User-provided context size silently overridden on fit failure**

If the user explicitly sets `context_size` in their config (e.g., `user_context_size = 3000`), it is pre-populated into `ctx_params.n_ctx` at line 225. However, when `llama_params_fit` returns `FAILURE` or `ERROR`, the fallback guard `ctx_params.n_ctx > 2048` will silently cap it to 2048 without any log message specific to the user's request being overridden. The general failure message doesn't make it obvious that the requested context was discarded.

Consider adding an explicit log when the user-supplied context is reduced:

```cpp
case LLAMA_PARAMS_FIT_STATUS_FAILURE:
    LOGI("llama_params_fit FAILURE: could not fit model to device memory. "
         "Proceeding with conservative CPU-only defaults.");
    model_params.n_gpu_layers = 0;
    if (ctx_params.n_ctx == 0 || ctx_params.n_ctx > 2048) {
        if (user_context_size > 2048) {
            LOGI("Ignoring user-requested context size %d due to fit failure; capping at 2048", user_context_size);
        }
        ctx_params.n_ctx = 2048;
    }
    break;
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (3)

887-889: ⚠️ Potential issue | 🟡 Minor

Potential null pointer dereference: llama_get_memory may return nullptr.

The call to llama_memory_clear doesn't check if llama_get_memory(context_) returns nullptr. Elsewhere in this file (lines 561-564 and 745-747), the result is checked before use.

🛡️ Proposed fix
     // Clear KV cache after adapter changes
-    llama_memory_clear(llama_get_memory(context_), true);
+    if (llama_memory_t mem = llama_get_memory(context_)) {
+        llama_memory_clear(mem, true);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 887 - 889, The call to llama_memory_clear uses llama_get_memory(context_)
without null-checking, risking a null pointer dereference; update the code
around where llama_memory_clear(llama_get_memory(context_), true) is called so
you first call auto* mem = llama_get_memory(context_) and only call
llama_memory_clear(mem, true) if mem != nullptr (mirror the null-check pattern
used earlier in this file around llama_get_memory), optionally handling or
logging the null case instead of calling into llama_memory_clear with a null
pointer.

936-939: ⚠️ Potential issue | 🟡 Minor

Same null pointer issue: guard llama_memory_clear call.

Although context_ is verified non-null, llama_get_memory can still return nullptr.

🛡️ Proposed fix
     if (context_) {
         llama_clear_adapter_lora(context_);
-        llama_memory_clear(llama_get_memory(context_), true);
+        if (llama_memory_t mem = llama_get_memory(context_)) {
+            llama_memory_clear(mem, true);
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 936 - 939, The code calls llama_memory_clear(llama_get_memory(context_),
true) after checking context_ but llama_get_memory(context_) may return nullptr;
modify the block around context_ (the code invoking llama_clear_adapter_lora and
llama_memory_clear) to first retrieve and check the memory pointer (e.g., auto
mem = llama_get_memory(context_)) and only call llama_memory_clear(mem, true) if
mem is non-null, while still calling llama_clear_adapter_lora(context_)
unconditionally when context_ is valid.

922-924: ⚠️ Potential issue | 🟡 Minor

Same null pointer issue: guard llama_memory_clear call.

🛡️ Proposed fix
     // Clear KV cache after adapter changes
-    llama_memory_clear(llama_get_memory(context_), true);
+    if (llama_memory_t mem = llama_get_memory(context_)) {
+        llama_memory_clear(mem, true);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 922 - 924, Guard the llama_memory_clear call to avoid null deref by
checking that context_ and the result of llama_get_memory(context_) are non-null
before calling llama_memory_clear; i.e., retrieve auto mem =
llama_get_memory(context_) (or otherwise ensure context_ is valid) and only call
llama_memory_clear(mem, true) when mem is not null, so the clear is skipped when
there is no memory to operate on.
🧹 Nitpick comments (1)
sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp (1)

326-331: Consider extracting magic number 2048 to a named constant.

The value 2048 for max_safe_batch appears multiple times in this file (here and in recreate_context at line 800). Consider defining a class constant like kDefaultMaxBatchSize for maintainability.

♻️ Suggested refactor

Add a constant at class scope (in the header) or at file scope:

static constexpr int kDefaultMaxBatchSize = 2048;

Then use it in both locations:

-    int max_safe_batch = 2048; // Configurable parameter
+    int max_safe_batch = kDefaultMaxBatchSize; // Configurable parameter
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp` around
lines 326 - 331, Extract the magic number 2048 into a named constant (e.g.,
static constexpr int kDefaultMaxBatchSize = 2048) and replace local uses like
max_safe_batch in the llm initialization block (where
ctx_params.n_batch/n_ubatch are set) and the recreate_context function with that
constant; update references to use kDefaultMaxBatchSize (or a class-scoped
equivalent) so both the block that sets safe_batch_size (using context_size_)
and the recreate_context implementation share the same symbolic value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 887-889: The call to llama_memory_clear uses
llama_get_memory(context_) without null-checking, risking a null pointer
dereference; update the code around where
llama_memory_clear(llama_get_memory(context_), true) is called so you first call
auto* mem = llama_get_memory(context_) and only call llama_memory_clear(mem,
true) if mem != nullptr (mirror the null-check pattern used earlier in this file
around llama_get_memory), optionally handling or logging the null case instead
of calling into llama_memory_clear with a null pointer.
- Around line 936-939: The code calls
llama_memory_clear(llama_get_memory(context_), true) after checking context_ but
llama_get_memory(context_) may return nullptr; modify the block around context_
(the code invoking llama_clear_adapter_lora and llama_memory_clear) to first
retrieve and check the memory pointer (e.g., auto mem =
llama_get_memory(context_)) and only call llama_memory_clear(mem, true) if mem
is non-null, while still calling llama_clear_adapter_lora(context_)
unconditionally when context_ is valid.
- Around line 922-924: Guard the llama_memory_clear call to avoid null deref by
checking that context_ and the result of llama_get_memory(context_) are non-null
before calling llama_memory_clear; i.e., retrieve auto mem =
llama_get_memory(context_) (or otherwise ensure context_ is valid) and only call
llama_memory_clear(mem, true) when mem is not null, so the clear is skipped when
there is no memory to operate on.

---

Nitpick comments:
In `@sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp`:
- Around line 326-331: Extract the magic number 2048 into a named constant
(e.g., static constexpr int kDefaultMaxBatchSize = 2048) and replace local uses
like max_safe_batch in the llm initialization block (where
ctx_params.n_batch/n_ubatch are set) and the recreate_context function with that
constant; update references to use kDefaultMaxBatchSize (or a class-scoped
equivalent) so both the block that sets safe_batch_size (using context_size_)
and the recreate_context implementation share the same symbolic value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88fc32f8-658a-4eaa-ac5d-7e9ccad8b1c2

📥 Commits

Reviewing files that changed from the base of the PR and between 7ed42a7 and 8508673.

📒 Files selected for processing (1)
  • sdk/runanywhere-commons/src/backends/llamacpp/llamacpp_backend.cpp

@shubhammalhotra28
Copy link
Copy Markdown
Contributor

@Siddhesh2377 can u do a quick sanity test for this for Android

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants