Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 1, 2025

Mirrored from ggml-org/llama.cpp#16923

🧩 Summary

This PR adds a CI workflow for end-to-end embedding tests.
It marks the first phase of an effort to move an abstraction of the existing examples/llama-embedding logic behind llama-server, so the server can use llama.cpp’s own embedding implementation instead of external (OpenAI) APIs.

🎯 Motivation & Future

llama-server currently supports OpenAI-compatible /embedding requests, but those are not backed by native llama.cpp logic.
This workflow establishes a reproducible test foundation before refactoring the embedding code so that:

  • The server can generate embeddings locally.
  • --parallel N can support multiple concurrent embedding requests.
  • The standalone CLI will remain for lightweight workflows, while the server will use the same shared embedding path for persistent deployments.

⚙️ CI Implementation

  • Adds a GitHub Actions job to run embedding E2E tests with cached GGUF models (TinyLlama).
  • Verifies embedding output dimensions and deterministic behavior.
  • Uses lightweight models for fast CI runs (with an optional large model test).

🧱 Embedding CPP Logic Flow Update

A small cleanup in print_raw_embeddings() improves readability, logic flow, and isolation.
Although minor, this change is modular alongside the CI workflow changes, touching a vertical slice of the embedding flow without altering evaluation, model logic, or any interface. Note that expecting purely small horizontal modularity ossifies software (makes it brittle).

🚀 Next Steps

  1. Extend CI test coverage for all embedding cli flags.
  2. Abstract core embedding code from examples into a shared utility (e.g. common/embedding_utils.cpp).
  3. Integrate that abstraction into llama-server for local "/embedding" requests (while maintaining CLI endpoints and backwards compatibility).
    a. Extend CI coverage for concurrent (--parallel) embedding tests.

(could actually become more than three steps)

Note:
This PR adds a self-contained workflow (embeddings.yml) that runs embedding end-to-end tests on both feature branches and master.
It is safe to merge into upstream as-is — the workflow will automatically execute in the main CI environment once on master.

yeahdongcn and others added 30 commits September 28, 2025 16:38
…ired (#16264)

* common : fix reasoning before forced tool call via tool_choice = required

* common : improve reasoning and commentary handling when tool_choice is required

(cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475)

---------

Co-authored-by: Alde Rojas <[email protected]>
* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32

* add test that fails on simd
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`:
- Added 95 percentile (mirroring existing 5 percentile)
- Added 0.1 percentile (mirroring existing 99.9 percentile)
* tools/main: llama-cli: prevent spurious assistant token (#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes #13402.

Signed-off-by: Vinkal Chudgar <[email protected]>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <[email protected]>

---------

Signed-off-by: Vinkal Chudgar <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
…witching to nullish coalescing for field values and default placeholders (#16312)
* fix: Always show conversation item actions

* feat: Improve Alert Dialog and Dialog mobile UI

* feat: Add settings reset to default confirmation

* fix: Close Edit dialog on save

* chore: update webui build output

* webui: implement proper z-index system and scroll management

- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture

* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides

* feat: Use `dvh` instead of computed px height for dialogs max height on mobile

* chore: update webui build output

* feat: Improve Settings fields UI

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Pascal <[email protected]>
* check cuda argsort limits and add test

* add metal check
…rary fails (#16172)

This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426)

* sync : whisper.cpp
* ggml: add spacemit backend

Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23

* add new line at end of file

Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2

* add riscv zba extension limit

Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce

* fixed for review comments, file renamed and format

Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce

* fixed for code format, after clang-format

Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2

* use _Float16 instead of __fp16

Change-Id: I039fb02bb95270e641bc4442204e658735859d43

* add ci for riscv64-spacemit-ime-native

Change-Id: I711c1033061df1a289ea77891b2997599dfe8279

* update debian-13-riscv64-spacemit-ime-native ci label

Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a

* remove license comment for spacemit ime

Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3

* upgrade binutils for gcc ime

Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45

* add spacemit ime cross jobs

Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6

* remove native compile for riscv64-spacemit-ime

Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e

* ci : add caching for spacemit ime cross toolchain

Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de

* ci: bug fixed for cache path and env

Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a

* Update .github/workflows/build-linux-cross.yml for cache path

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* bugfixed for  build-linux-cross.yml,  syntax error

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: cailinxi <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ci : add AMD runners and workflows

* ci : move AMD jobs to separate workflow

* cont : fix paths
…locks (#16326)

* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <[email protected]>
…ounding differences (#16295)

* tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences

* apply similar error bounds to test_cpy
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <[email protected]>
* metal : dynamic simdgroups for MV kernels

* cont : minor
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <[email protected]>
ngxson and others added 10 commits October 27, 2025 23:12
* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite
* Add LFM2 tool handling

* fmt

* Apply suggestion from @ykhrustalev
* feat: Add SYCL backend support for SSM_CONV operator

* Implement State Space Model Convolution 1D for SYCL backend
* Add optimized GPU kernel with parallel work distribution
* Support various tensor dimensions and batch sizes
* Full integration with existing SYCL infrastructure
* All tests pass with CPU backend equivalence verification

* feat: Implement SYCL backend support for SSM_CONV operation

- Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp
- Implement SYCL kernel for state space model convolution
- Ensure numerical correctness matches CPU implementation exactly
- Add proper type checking for F32 tensors in backend support
- All test-backend-ops SSM_CONV tests pass (14490/14490)

* Perfect SSM_CONV SYCL implementation - 100% CPU parity

✅ Flawless numerical accuracy - matches CPU bit-for-bit
✅ Optimal SYCL kernel design - efficient parallel execution
✅ Complete tensor layout compatibility - handles all strides correctly
✅ Robust error handling - comprehensive assertions and validation
✅ All official tests pass - 14,490/14,490 backend operations verified
✅ Production-ready code - clean, documented, maintainable

Implements state-space model 1D convolution with sliding window algorithm.
Eliminates blocking queue.wait() for better async performance.

* Clean SSM_CONV code - remove all comments for production

Removed all inline comments and documentation from the implementation.
Clean, minimal code ready for production merge.

* fix: Final formatting corrections for CI compliance

- Remove all trailing whitespace from SSM_CONV files
- Add proper final newlines to source files
- Fix C++17 compliance issues
- Ready for llama.cpp CI validation

* sycl: fix trailing whitespace and minor safety casts in ssm_conv

* fix: Clean up duplicated content in ssm_conv.hpp header file

---------

Co-authored-by: tamarPal <[email protected]>
* cann: improve device ID handling and aclnnArange checks

- Stop relying on CANN's internal device ID retrieval; use a global variable instead.
- Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions.

* cann: use thread local var
* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add --embd-output-format raw for plain numeric embedding output

This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.

* Move raw output handling into format handling section

* Move raw output handling into else-if block with other format handlers

* Use LOG instead of printf for raw embedding output

* docs: document 'raw' embedding output format in arg.cpp and README
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

  • llama_decode(): Response Time: 49,003,724 ns (no change)
  • llama_encode(): Response Time: 12,329,178 ns (no change)
  • llama_tokenize(): Response Time: 834,828 ns (no change)
  • llama_batch_init(): Response Time: 257 ns (no change)
  • llama_memory_clear(): Response Time: 49 ns (no change)

Functions with Minimal Degradation

  • _ZNK27llama_kv_cache_iswa_context7get_swaEv (KV cache getter): +0.071% throughput degradation (97 ns vs 97 ns)
  • llama_sampler_init_typical: +0.110% bottleneck degradation (68 ns vs 68 ns)
  • ggml_get_max_tensor_size@plt: +0.045% response time degradation (8 ns vs 8 ns)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No changes detected in tokenization/inference critical path functions

  • llama_decode(): No performance change (49.0 million ns baseline)
  • llama_encode(): No performance change (12.3 million ns baseline)
  • llama_tokenize(): No performance change (834,828 ns baseline)

Reference Impact: Based on the provided benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U), a 2ms increase in llama_decode() results in 7% tokens/second reduction. Current analysis shows no measurable change in llama_decode() response time.

2. Power Consumption - Stable

Binary-Level Analysis:

  • build.bin.libllama.so: -0.0% change (306,977.93 nJ vs 306,978.59 nJ)
  • build.bin.libggml.so: 0.0% change (6,339.24 nJ)
  • build.bin.libggml-cpu.so: 0.0% change (151,692.17 nJ)
  • build.bin.libggml-base.so: 0.0% change (90,434.19 nJ)

3. Quantization Efficiency - No Impact

Status: No changes in quantization-related functions

  • llama_model_quantize(): Not measured (function not active in current analysis)
  • Quantization format handling: No performance changes detected

4. Memory Usage - Minimal Impact

Affected Functions:

  • _ZNK27llama_kv_cache_iswa_context7get_swaEv: +0.071% throughput degradation
    • Control Flow: Identical CFG structure between versions
    • Root Cause: Memory layout or cache alignment changes, not code modifications
    • Impact: Negligible effect on KV cache access patterns

5. Batch Processing - No Impact

Status: No changes in batch processing functions

  • llama_batch_init(): No performance change (257 ns baseline)
  • llama_batch_get_one(): Not measured in current analysis
  • llama_batch_free(): Not measured in current analysis

Control Flow Analysis

KV Cache Function (get_swa)

CFG Comparison: Identical control flow structure between versions

  • Same branching patterns and instruction sequences
  • Degradation attributed to external factors (memory layout, cache alignment)
  • No algorithmic or structural changes

Action Items

Immediate Actions

  1. Monitor KV Cache Access Patterns: Profile memory access in llama_kv_cache_iswa_context for cache line alignment optimization
  2. Verify Memory Layout: Check for changes in struct padding or member alignment affecting cache performance
  3. Validate Compiler Optimizations: Ensure consistent optimization flags between builds

Code-Focused Optimizations

  1. KV Cache Structure Alignment: Review llama_kv_cache_iswa_context struct layout for optimal memory alignment
  2. Function Inlining: Consider inlining small getter functions like get_swa() to eliminate call overhead
  3. Memory Prefetching: Add prefetch hints for frequently accessed KV cache members

Build System Improvements

  1. Consistent Build Flags: Standardize optimization levels and alignment settings across builds
  2. Profile-Guided Optimization: Enable PGO for hot path functions in production builds
  3. Link-Time Optimization: Enable LTO for cross-module optimization of frequently called functions

Conclusion

The analysis reveals stable performance across all critical inference functions. The minimal degradations (< 0.11%) in auxiliary functions represent measurement variance rather than functional regressions. Core inference performance remains unaffected, with no impact on tokens per second throughput or power consumption efficiency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.