-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UPSTREAM PR #16923: Add e2e tests for embedding raw flag #36
Conversation
Signed-off-by: Xiaodong Ye <[email protected]>
…ired (#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475) --------- Co-authored-by: Alde Rojas <[email protected]>
* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)
* tools/main: llama-cli: prevent spurious assistant token (#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes #13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
…witching to nullish coalescing for field values and default placeholders (#16312)
* fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <[email protected]>
* check cuda argsort limits and add test * add metal check
…rary fails (#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp
* ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths
…locks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]>
…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy
The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]>
* metal : dynamic simdgroups for MV kernels * cont : minor
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]>
* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite
* Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev
* feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <[email protected]>
* cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var
* grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README
|
Access the complete analysis in the LOCI Dashboard LLaMA.cpp Performance Analysis SummaryCritical Function Performance StatusCore Inference Functions - No Performance Impact
Functions with Minimal Degradation
KPI Impact Analysis1. Tokens Per Second - No ImpactStatus: No changes detected in tokenization/inference critical path functions
Reference Impact: Based on the provided benchmark (ollama://smollm:135m on 12th Gen Intel i7-1255U), a 2ms increase in 2. Power Consumption - StableBinary-Level Analysis:
3. Quantization Efficiency - No ImpactStatus: No changes in quantization-related functions
4. Memory Usage - Minimal ImpactAffected Functions:
5. Batch Processing - No ImpactStatus: No changes in batch processing functions
Control Flow AnalysisKV Cache Function (
|
b655780 to
94ec54d
Compare
Mirrored from ggml-org/llama.cpp#16923
🧩 Summary
This PR adds a CI workflow for end-to-end embedding tests.
It marks the first phase of an effort to move an abstraction of the existing
examples/llama-embeddinglogic behindllama-server, so the server can use llama.cpp’s own embedding implementation instead of external (OpenAI) APIs.🎯 Motivation & Future
llama-servercurrently supports OpenAI-compatible/embeddingrequests, but those are not backed by native llama.cpp logic.This workflow establishes a reproducible test foundation before refactoring the embedding code so that:
--parallel Ncan support multiple concurrent embedding requests.⚙️ CI Implementation
🧱 Embedding CPP Logic Flow Update
A small cleanup in
print_raw_embeddings()improves readability, logic flow, and isolation.Although minor, this change is modular alongside the CI workflow changes, touching a vertical slice of the embedding flow without altering evaluation, model logic, or any interface. Note that expecting purely small horizontal modularity ossifies software (makes it brittle).
🚀 Next Steps
a. Extend CI coverage for concurrent (--parallel) embedding tests.
(could actually become more than three steps)