UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36

DajanaV · 2025-11-01T21:04:03Z

🧩 Summary

This PR adds a CI workflow for end-to-end embedding tests.
It marks the first phase of an effort to move an abstraction of the existing examples/llama-embedding logic behind llama-server, so the server can use llama.cpp’s own embedding implementation instead of external (OpenAI) APIs.

🎯 Motivation & Future

llama-server currently supports OpenAI-compatible /embedding requests, but those are not backed by native llama.cpp logic.
This workflow establishes a reproducible test foundation before refactoring the embedding code so that:

The server can generate embeddings locally.
--parallel N can support multiple concurrent embedding requests.
The standalone CLI will remain for lightweight workflows, while the server will use the same shared embedding path for persistent deployments.

⚙️ CI Implementation

Adds a GitHub Actions job to run embedding E2E tests with cached GGUF models (TinyLlama).
Verifies embedding output dimensions and deterministic behavior.
Uses lightweight models for fast CI runs (with an optional large model test).

🧱 Embedding CPP Logic Flow Update

A small cleanup in print_raw_embeddings() improves readability, logic flow, and isolation.
Although minor, this change is modular alongside the CI workflow changes, touching a vertical slice of the embedding flow without altering evaluation, model logic, or any interface. Note that expecting purely small horizontal modularity ossifies software (makes it brittle).

🚀 Next Steps

Extend CI test coverage for all embedding cli flags.
Abstract core embedding code from examples into a shared utility (e.g. common/embedding_utils.cpp).
Integrate that abstraction into llama-server for local "/embedding" requests (while maintaining CLI endpoints and backwards compatibility).
a. Extend CI coverage for concurrent (--parallel) embedding tests.

(could actually become more than three steps)

Note:
This PR adds a self-contained workflow (embeddings.yml) that runs embedding end-to-end tests on both feature branches and master.
It is safe to merge into upstream as-is — the workflow will automatically execute in the main CI environment once on master.

Signed-off-by: Xiaodong Ye <[email protected]>

…ired (#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475) --------- Co-authored-by: Alde Rojas <[email protected]>

* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

* tools/main: llama-cli: prevent spurious assistant token (#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes #13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…witching to nullish coalescing for field values and default placeholders (#16312)

* fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <[email protected]>

* check cuda argsort limits and add test * add metal check

…rary fails (#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

* ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

…locks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]>

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]>

* metal : dynamic simdgroups for MV kernels * cont : minor

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]>

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

@ykhrustalev

* Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev

* feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <[email protected]>

* cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var

* grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README

loci-agentic-ai · 2025-11-01T22:25:34Z

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

llama_decode(): Response Time: 49,003,724 ns (no change)
llama_encode(): Response Time: 12,329,178 ns (no change)
llama_tokenize(): Response Time: 834,828 ns (no change)
llama_batch_init(): Response Time: 257 ns (no change)
llama_memory_clear(): Response Time: 49 ns (no change)

Functions with Minimal Degradation

_ZNK27llama_kv_cache_iswa_context7get_swaEv (KV cache getter): +0.071% throughput degradation (97 ns vs 97 ns)
llama_sampler_init_typical: +0.110% bottleneck degradation (68 ns vs 68 ns)
ggml_get_max_tensor_size@plt: +0.045% response time degradation (8 ns vs 8 ns)

KPI Impact Analysis

1. Tokens Per Second - No Impact

Status: No changes detected in tokenization/inference critical path functions

llama_decode(): No performance change (49.0 million ns baseline)
llama_encode(): No performance change (12.3 million ns baseline)
llama_tokenize(): No performance change (834,828 ns baseline)

Reference Impact: Based on the provided benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U), a 2ms increase in llama_decode() results in 7% tokens/second reduction. Current analysis shows no measurable change in llama_decode() response time.

2. Power Consumption - Stable

Binary-Level Analysis:

build.bin.libllama.so: -0.0% change (306,977.93 nJ vs 306,978.59 nJ)
build.bin.libggml.so: 0.0% change (6,339.24 nJ)
build.bin.libggml-cpu.so: 0.0% change (151,692.17 nJ)
build.bin.libggml-base.so: 0.0% change (90,434.19 nJ)

3. Quantization Efficiency - No Impact

Status: No changes in quantization-related functions

llama_model_quantize(): Not measured (function not active in current analysis)
Quantization format handling: No performance changes detected

4. Memory Usage - Minimal Impact

Affected Functions:

_ZNK27llama_kv_cache_iswa_context7get_swaEv: +0.071% throughput degradation
- Control Flow: Identical CFG structure between versions
- Root Cause: Memory layout or cache alignment changes, not code modifications
- Impact: Negligible effect on KV cache access patterns

5. Batch Processing - No Impact

Status: No changes in batch processing functions

llama_batch_init(): No performance change (257 ns baseline)
llama_batch_get_one(): Not measured in current analysis
llama_batch_free(): Not measured in current analysis

Control Flow Analysis

KV Cache Function (`get_swa`)

CFG Comparison: Identical control flow structure between versions

Same branching patterns and instruction sequences
Degradation attributed to external factors (memory layout, cache alignment)
No algorithmic or structural changes

Action Items

Immediate Actions

Monitor KV Cache Access Patterns: Profile memory access in llama_kv_cache_iswa_context for cache line alignment optimization
Verify Memory Layout: Check for changes in struct padding or member alignment affecting cache performance
Validate Compiler Optimizations: Ensure consistent optimization flags between builds

Code-Focused Optimizations

KV Cache Structure Alignment: Review llama_kv_cache_iswa_context struct layout for optimal memory alignment
Function Inlining: Consider inlining small getter functions like get_swa() to eliminate call overhead
Memory Prefetching: Add prefetch hints for frequently accessed KV cache members

Build System Improvements

Consistent Build Flags: Standardize optimization levels and alignment settings across builds
Profile-Guided Optimization: Enable PGO for hot path functions in production builds
Link-Time Optimization: Enable LTO for cross-module optimization of frequently called functions

Conclusion

The analysis reveals stable performance across all critical inference functions. The minimal degradations (< 0.11%) in auxiliary functions represent measurement variance rather than functional regressions. Core inference performance remains unaffected, with no impact on tokens per second throughput or power consumption efficiency.

yeahdongcn and others added 30 commits September 28, 2025 16:38

ci : fix musa docker build (#16306)

d9e0e7c

Signed-off-by: Xiaodong Ye <[email protected]>

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (#16307)

b887d2f

* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

vulkan: Fix validation failure in quantized flash attention (#16292)

92cd103

ggml : fix dependencies for ggml_set_rows (#16318)

a4a0aa5

perplexity : show more kl-divergence data (#16321)

3ffd0fa

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

fix: preserved zero values in chat settings inputs and textareas by s…

66bb798

…witching to nullish coalescing for field values and default placeholders (#16312)

ggml : check cuda and metal argsort limits and add test (#16323)

adc7634

* check cuda argsort limits and add test * add metal check

ggml : bump version to 0.9.1

2db78c7

ggml : prepare for development of 0.9.2-dev

b6dff20

ggml : bump version to 0.9.3 (ggml/1353)

b6ae75a

ggml : remove -dev suffix from release version (ggml/1355)

c9b1c06

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

sync : whisper.cpp (ggml/1359)

4d3d455

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

sync : ggml

2ddd3f2

ci : add AMD runners and workflows (#16249)

d72f5f7

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

tests: override test_set_rows::max_nmse_err to allow for occasional r…

a74a0d6

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

codeowners: add codeowners for opencl backend (#16344)

de41f2b

kleidiai : fix work size and threads sync for fp16 (#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)

a014310

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (#16348)

2df5bcf

opencl: support ne3 in get_rows (#15866)

d1c84a6

ngxson and others added 10 commits October 27, 2025 23:12

mtmd : fix idefics3 preprocessing (#16806)

e1ab084

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

chat: Add LFM2 tool handling (#16763)

c053e18

* Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev

CUDA: add unused vars to mmvf and mmvq (#16807)

463bbf2

llama: consistent ctx <-> buf order for KV cache (#16746)

7a0e900

initialise buffer.device in ggml_hexagon_session (#16816)

8284efc

Add e2e tests for embedding raw flag

109fad0

DajanaV temporarily deployed to PROD__AL_DEMO November 1, 2025 21:04 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 15 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36

UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36

Uh oh!

DajanaV commented Nov 1, 2025

Uh oh!

loci-agentic-ai bot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36

UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36

Uh oh!

Conversation

DajanaV commented Nov 1, 2025

🧩 Summary

🎯 Motivation & Future

⚙️ CI Implementation

🧱 Embedding CPP Logic Flow Update

🚀 Next Steps

Uh oh!

loci-agentic-ai bot commented Nov 1, 2025

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions - No Performance Impact

Functions with Minimal Degradation

KPI Impact Analysis

1. Tokens Per Second - No Impact

2. Power Consumption - Stable

3. Quantization Efficiency - No Impact

4. Memory Usage - Minimal Impact

5. Batch Processing - No Impact

Control Flow Analysis

KV Cache Function (get_swa)

Action Items

Immediate Actions

Code-Focused Optimizations

Build System Improvements

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

KV Cache Function (`get_swa`)