Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17504

Updated server-local-image-loading feature branch with latest master branch changes

This is the same changeset reflected in ggml-org/llama.cpp#16874 but updated to the latest changes in master, as too much had diverged and the prior pull request is stale at this point.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
PR #324: Server support for local image path loading
Comparison: Target version 34caec7c vs Baseline aab9b31c


Analysis Overview

This PR adds local file loading capability for multimodal models via file:// URLs with security controls. The implementation introduces two new command-line arguments (--allowed-local-media-path, --local-media-max-size-mb) and file validation logic in the server request processing path.


Key Findings

Performance-Critical Areas Impact

Argument Parsing Module (common/arg.cpp):
The observed performance degradations occur in lambda operators within common_params_parser_init, specifically in regex scanning operations. The top affected functions show response time increases ranging from 4159 microseconds to 72070 microseconds, with throughput increases between 67 nanoseconds and 424 nanoseconds.

This PR adds two new argument parsing lambdas that perform filesystem operations (std::filesystem::canonical, std::filesystem::is_directory) rather than regex operations. These filesystem calls execute once during server initialization and are not on the critical inference path.

Inference and Tokenization Functions:
No changes were made to core inference functions (llama_decode, llama_encode, llama_tokenize, llama_model_load_from_file). The multimodal processing integration uses existing mtmd_default_marker() patterns without modifying token processing logic. Request-time file loading occurs during prompt preparation, before inference begins.

Tokens Per Second Impact

No measurable impact on inference throughput. The affected functions are in argument parsing and request preprocessing, not in the token generation loop. File loading operations (2-12 milliseconds per image) occur during request initialization, outside the inference cycle. Core tokenization and decode functions remain unchanged.

Power Consumption Analysis

Power consumption changes across binaries are minimal:

  • build.bin.llama-cvector-generator: 516 nanojoules decrease (0.185% reduction)
  • build.bin.llama-tts: 204 nanojoules decrease (0.072% reduction)
  • build.bin.llama-quantize: 16 nanojoules increase (0.040% increase)
  • build.bin.llama-bench: 2 nanojoules increase (0.004% increase)

All changes are below 0.2%, indicating no meaningful power consumption impact from this PR.

Code Changes Context

The implementation adds security-focused file access controls: path canonicalization prevents directory traversal, prefix matching enforces directory whitelisting, file type validation restricts to regular files, and size limits prevent resource exhaustion. These are one-time validation operations during request processing, not repeated during inference.

The extreme percentage increases observed in argument parsing functions (up to 2,876,615%) reflect changes in regex pattern complexity across the broader codebase, not filesystem operations introduced by this PR. The absolute time increases (microseconds range) occur during initialization, which happens once per server start or once per request for file validation.

@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants