server : return HTTP 400 if prompt exceeds context length #16486

rgerganov · 2025-10-09T13:40:24Z

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in such cases.

ngxson · 2025-10-09T14:21:22Z

Hmm that's strange, we have a specific error type for this, ERROR_TYPE_EXCEED_CONTEXT_SIZE. The error code is 400:

llama.cpp/tools/server/server.cpp

Lines 1268 to 1271 in 56b4795

    
           case ERROR_TYPE_EXCEED_CONTEXT_SIZE: 
        
               type_str = "exceed_context_size_error"; 
        
               code = 400; 
        
               break;

We also have this test case:

llama.cpp/tools/server/tests/unit/test_chat_completion.py

Lines 393 to 408 in 56b4795

    
           def test_context_size_exceeded(): 
        
               global server 
        
               server.start() 
        
               res = server.make_request("POST", "/chat/completions", data={ 
        
                   "messages": [ 
        
                       {"role": "system", "content": "Book"}, 
        
                       {"role": "user", "content": "What is the best book"}, 
        
                   ] * 100, # make the prompt too long 
        
               }) 
        
               assert res.status_code == 400 
        
               assert "error" in res.body 
        
               assert res.body["error"]["type"] == "exceed_context_size_error" 
        
               assert res.body["error"]["n_prompt_tokens"] > 0 
        
               assert server.n_ctx is not None 
        
               assert server.n_slots is not None 
        
               assert res.body["error"]["n_ctx"] == server.n_ctx // server.n_slots

I'm wondering which input leads to the 200 code that you mentioned?

rgerganov · 2025-10-09T14:33:57Z

The issue occurs only in streaming mode. In non-streaming it correctly returns 400.

rgerganov · 2025-10-09T15:42:55Z

I have added a new test which covers exceeding the context in streaming mode.

tools/server/server.cpp

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.

ngxson · 2025-10-10T11:05:19Z

tools/server/server.cpp

                inputs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true);
            }
-
+            const size_t n_ctx_slot = ctx_server.n_ctx / ctx_server.params_base.n_parallel;


I'm thinking maybe this check can better be done inside launch_slot_with_task? There you will have access to slot.n_ctx

Unfortunately there is no way to return non-200 status code once you call res.set_chunked_content_provider(...). That's why I am doing the check before that.

Hmm ok then, I'll refactor this code in #16488 , for now this can be a temporary soltuion

ngxson · 2025-10-10T11:07:52Z

tools/server/server.cpp

+                    json error_data = format_error_response("the request exceeds the available context size, try increasing it", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
+                    error_data["n_prompt_tokens"] = n_prompt_tokens;
+                    error_data["n_ctx"] = n_ctx_slot;
+                    res_error(res, error_data);


If this is handled inside launch_slot_with_task, you can call send_error(slot, ".....", ERROR_TYPE_EXCEED_CONTEXT_SIZE);, which should simplify things a bit

…6486) In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.

… in tool chain (#74) * server : remove old LLAMA_SERVER_SSL (#16290) Signed-off-by: Adrien Gallouët <[email protected]> * vulkan: throw system error instead of SIGABRT during init on older devices (#16156) * Throw system error on old Vulkan driver rather than SIGABRT * Optionally handle any potential error in vulkan init * CUDA: refactor and deduplicate vector FA kernels (#16208) * CUDA: refactor and deduplicate vector FA kernels * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 (#16277) * CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32 This commit adds mul_mat_id support for ncols_dst >= 16. It does this by packing ncols_dst tiles into the blockDim.y. My tests on a RTX 3090 show that this is faster than the cuBLAS fallback for f16 till bs=64, and for f32 till bs=32 * Review: refactor if statement * Show message actions by default (#16289) * vulkan : make the vulkan.hpp dynamic dispatcher instance private (#16224) * don't use VULKAN_HPP_DEFAULT_DISPATCH_LOADER_DYNAMIC_STORAGE which can cause conflicts if application or other libraries do the same * vulkan: support arbitrary KV dimension in flash attention (#16160) The "Clamp" spec constant is already based on whether KV is a multiple of Bc, so use that to control whether bounds checking is performed. Add bounds checking to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V tensors are already optionally clamped, nothing else needed to be changed). * vulkan: handle mat_mul with A matrix > 4GB (#16176) * vulkan: handle mat_mul with A matrix > 4GB This change splits mat_mul operations with huge A matrix into chunks in the M dimension. This works well for stable-diffusion use cases where the im2col matrix has very large M. Fix the order of setting the stride in mul_mm_cm2 - setting the dimension clobbers the stride, so stride should be set after. * build fixes * metal : fuse non-sequential nodes (#16102) * metal : fuse non-sequential nodes * cont : add comment * cont : simplify bounds checks * metal : extend mat-mat multiplication support (#16225) * metal : support mul_mm with src1->type == GGML_TYPE_F16 * metal : support mul_mm_id with src1->type == GGML_TYPE_F16 [no ci] * metal : mul_mm support ne00 % 32 != 0 * metal : support mul_mm_id with ne00 % 32 != 0 * cont : remove unnecessary unrolls * cont : simplify data loading * metal : optimize mul_mm when output bounds checks are not needed * vulkan: 64-bit im2col (#16135) * vulkan: 64-bit im2col Add variants of the im2col shaders that use buffer_device_address/buffer_reference, and use 64-bit address calculations. This is needed for large convolutions used in stable-diffusion.cpp. * fix validation error for large im2col * Fixed a few typos in the README of the LLaMA.cpp HTTP Server [no ci] (#16297) * devops: switch to using ubuntu-22.04-s390x image (#16302) Signed-off-by: Aaron Teo <[email protected]> * ci : fix musa docker build (#16306) Signed-off-by: Xiaodong Ye <[email protected]> * common : fix reasoning before forced tool call via tool_choice = required (#16264) * common : fix reasoning before forced tool call via tool_choice = required * common : improve reasoning and commentary handling when tool_choice is required (cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475) --------- Co-authored-by: Alde Rojas <[email protected]> * ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (#16307) * fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd * vulkan: Fix validation failure in quantized flash attention (#16292) * ggml : fix dependencies for ggml_set_rows (#16318) * perplexity : show more kl-divergence data (#16321) Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile) * llama-cli: prevent spurious assistant token (#16202) * tools/main: llama-cli: prevent spurious assistant token (#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes #13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) * Improve Mobile UI for dialogs and action dropdowns (#16222) * fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <[email protected]> * ggml : check cuda and metal argsort limits and add test (#16323) * check cuda argsort limits and add test * add metal check * ggml-backend : add root cause in error message if loading backend library fails (#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.). * ggml : bump version to 0.9.1 * ggml : prepare for development of 0.9.2-dev * ggml : bump version to 0.9.3 (ggml/1353) * ggml : remove -dev suffix from release version (ggml/1355) This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`. * sync : whisper.cpp (ggml/1359) * ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp * sync : ggml * ggml: riscv: add riscv spacemit backend (#15288) * ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * ci : add AMD runners and workflows (#16249) * ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths * Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]> * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy * codeowners: add codeowners for opencl backend (#16344) * kleidiai : fix work size and threads sync for fp16 (#16246) * common : simplify etag tracking by removing json (#16342) The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]> * metal : dynamic simdgroups for MV kernels (#16340) * metal : dynamic simdgroups for MV kernels * cont : minor * cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328) * Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes * ggml : bump version to 0.9.4 (ggml/1363) * ci : disable ccache for android (#16348) * common : remove common_has_curl() (#16351) `test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]> * opencl: support ne3 in get_rows (#15866) * ggml webgpu: support for rope,div,sub,glu,scale,cont operators (#16187) * Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <[email protected]> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <[email protected]> * Chatapi ignore empty sampling (#16330) * fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output * opencl: support pad_ext (#15888) * common : disable progress bar without a tty (#16352) * common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <[email protected]> * Add missing headers Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> * ci : fix ccache key for ubuntu-cpu-cmake (#16355) * fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci] * model : support GLM 4.6 (make a few NextN/MTP tensors not required) (#16359) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models * webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363) * vulkan: make ggml_vk_default_dispatcher support older vulkan headers (#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using * Add optional setting for showing "Model used:" information (#16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output * ci : use registry cache for docker builds (#16366) * Improve code block color theming (#16325) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build * Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build * common: introduce http.h for httplib-based client (#16373) * common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <[email protected]> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> Signed-off-by: Adrien Gallouët <[email protected]> * ci: Properly install rocwmma for hip builds (#16305) * CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0 * llama : parameter conversion and loading fixes for PLaMo2 variants (#16075) * Fix to use hidden_size_per_head * Fix num heads * Fix array * Fix loading weights * Support old GGUF converted by the previous version of llama.cpp * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Move shared parameter definitions to the outside of loop * Not calculating n_embd_head_k,v by n_embd / n_head --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> * HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 (#16221) * HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA * CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn * CI: reenable cdna in rocm docker builds (#16376) * HIP: add IMbackK to codeowner (#16375) * SYCL: Update to oneAPI 2025.2 (#16371) * update oneapi to 2025.2, use deep-learning-essentials to replace base-tool * update to 2025.2 use deeplearn essi to replace base toolkit * add missed dll * add deep learning essentials * add sycl-ls --------- Co-authored-by: Zhang Jianyu <[email protected]> * ci : fix clean-up of old logs (#16381) * ci: update vulkan ci (#16294) * ci : fix ubuntu-latest-cmake-rpc (disable ccache) (#16388) * musa: update compile flags (#16265) Signed-off-by: Xiaodong Ye <[email protected]> * model : Apertus model implementation (#15852) * First attempt * No permute during convert (fixes qk tensors), proper norm application. * RoPE = NeoX * Coherence! * Migrate xielu params from tensors to hyperparameters * Simple CUDA kernel * Revert stupid LLM refactorings * Chat template support * configchecker / flake8 errors * Reorder unary.cu * I do conclude that LLMs are, in fact, stupid. * Fix after merge * Final newline * Make xIELU an UNARY_OP * Final newline * Correctly account for parameter shift * Argh. * Update ggml/src/ggml-cpu/unary-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Refactor: remove unused methods, inline and factorize softplus, add const modifiers * Revert CUDA changes, implement xIELU as a separate OP * Pesky newline * Add float2half / half2float for F16 inputs/outputs * CUDA variants, attempt 2 * Actually, attempt 3 * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Missing convert header * Proper formula and reference for xIELU in the comments. * Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add tensor mappings for Apertus to global list instead * Fix lazy on scalars * Update ggml/src/ggml-cuda/unary.cu Co-authored-by: Johannes Gäßler <[email protected]> * Add comment about the constraints on positive/negative alpha * Change `softplus` to `ggml_softplus` --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> * ggml webgpu: add support for soft_max, optimize rms_norm (#16357) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * Update tests/test-backend-ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]> * test-barrier : do not use more threads than physically available (#16389) * do not use more threads than physically available * ensure n_threads > 0 Co-authored-by: Jeff Bolz <[email protected]> --------- Co-authored-by: Jeff Bolz <[email protected]> * fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <[email protected]> * webui : Fix messages payload sent to chat completions (#16402) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output * vulkan: in flash attention, bounds check against nem1 (don't rely on GGML_KQ_MASK_PAD) (#16316) * Capture model name only after first token (streaming) or completed request (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output * ci : change macos-13 to macos-15-intel (#16401) This commit updates the macos-13 runners to macos-15-intel. The motivation for this changes is the macos-13 runners are scheduled to be retired on 2025-12-04. Refs: https://github.blog/changelog/2025-09-19-github-actions-macos-13-runner-image-is-closing-down/ * vulkan: Fix FA coopmat1 invalid array indexing (#16365) When computing sinks, the cm1 shader was looping r from 0 to Br rather than to rows_per_thread. I must have copied this from the scalar path (where it is correct), and somehow it wasn't causing failures on current drivers. * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE (#16354) * vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers. The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed beyond that limit. This allows > 4GB buffers to be allocated on some implementations (e.g. NVIDIA) and tensors this large can be used for im2col and mul_mat. For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange. I'm not sure this check is ideal, but we always use these buffers as a single full size binding and the limit may be smaller than maxMemoryAllocationSize or maxBufferSize, so I think this is reasonable. Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range. The maxStorageBufferRange may be smaller than the maxBufferSize or maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and it's invalid usage if VK_WHOLE_SIZE computes a range larger than maxStorageBufferRange. With this change, it should be possible to generate videos using wan networks in stable-diffusion.cpp. * vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull * Fix missing messages on sibling navigation (#16408) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output * ggml : fix graph reallocation with multiple chunks (#16396) reallocation is needed if a single chunk grows in size, even if total allocation size stays the same or is lower * llama : fix shapes for bert/mpt q/k norm (#16409) * metal : fix loop bound in ggml_mem_ranges (#16412) * server : context checkpointing for hybrid and recurrent models (#16382) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <[email protected]> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <[email protected]> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <[email protected]> * chat : support Magistral thinking (#16413) * feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls * feat: new flow in the chat template test suite for Magistral * vulkan : incremental shader builds (#16341) * vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times * support dep-files so shaders are recompiled if their included files change * rename shader files which are used as "headers" to use .glsl extension * move glslc extension detection shaders to separate folders * the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled * vulkan : only write embedded shader .hpp/.cpp when they change * avoid recompiling ggml-vulkan.cpp when editing shaders * pass single --source argument instead of --input-dir & --filter to shader gen * check for source file match earlier * fix hang in vulkan-shaders-gen when there are compilation errors * early out did not decrement compile_count * clean up * fix glslc integer dot product test * unconditionally write the embedded shader cpp output * replace output filepath in generated dep-files to match output in CMakeLists --------- Co-authored-by: Jeff Bolz <[email protected]> * rpc : add support for multiple devices (#16276) * rpc : add support for multiple devices Allow rpc-server to expose multiple devices from a single endpoint. Change RPC protocol to include device identifier where needed. closes: #15210 * fixes * use ggml_backend_reg_t * address review comments * fix llama-bench backend report * address review comments, change device naming * fix cmd order * rpc : check src buffer when copying tensor (#16421) Only dst buffer is guaranteed to be an RPC buffer. Add check for the src one. * vulkan: use a more appropriate amount of threads when generating shaders (#16418) * use a more flexible amount of threads * fix windows compile and 0 thread case * nominmax * ggml webgpu: actually add softmax, fix rms_norm offset (#16400) * implement soft_max * Fix soft_max data race * Temporary fix, wait on each submit * model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> * server: update readme to mention n_past_max metric (#16436) https://github.com/ggml-org/llama.cpp/pull/15361 added new metric exported, but I've missed this doc. * nix : removed metal for nix (#16118) * ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443) This commit updates the leftover handling in ggml_vec_scale_f32. The motivation for this is that the code currently incorrectly assumes there would be fewer than ggml_f32_epr leftover elements. However, since the main loop processes 2*ggml_f32_epr elements per iteration , there can be up to (2*ggml_f32_epr - 1) leftover elements. The original single-pass leftover code could only process ggml_f32_epr elements, leaving some elements unscaled. Example scenario with 256-bit SVE: ``` ggml_f32_epr = 8 (elements per register) ggml_f32_step = 16 (two registers per iteration) n = 25 np = 16 leftovers = 9 elements (16-24) Original : processes only elements 16-23, misses element 24 This commit : loop processes elements 16-23, then element 24 ``` Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630 * ci : remove missing reranker model files (#16444) This commit removes jina-reranker-v1-tiny-en model files that are no longer present on Hugging Face. The motivation for this that it clears up the CI logs from 404 errors which can be a little confusing when looking at the logs the first time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630#step:5:2649 * ggml : fix unaligned access in AMX code (#16315) * ci : refactor sdk caching to minimize storage (#16414) * refactor sdk caching to minimize storage * use correct action * add myself as owner to /.github/actions/ [no ci] * chat : Granite Docling stopping (#16438) * fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]> * llama : add --no-host to disable host buffers (#16310) * implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <[email protected]> * metal : various optimizations + refactoring (#16446) * metal : ssm_scan minor opts * metal : get_rows optimize * metal : cpy optimize * metal : ssm_conv opt * metal : ssm_scan simplify * metal : ssm_Scan opt * tests : add -INF blocks to the KQ mask in the FA tests (#16380) * tests : add -INF blocks to the KQ mask in the FA tests * cont : bump -INF block size to 64 Co-authored-by: Jeff Bolz <[email protected]> * ggml : prevent division by zero in FA CPU op --------- Co-authored-by: Jeff Bolz <[email protected]> * metal : add support for non-padded FA KV (#16148) * metal : pad K, V and Mask when needed * cont : simplify * cuda : add TODO about KV padding requirement * metal : add comments * metal : remove mask padding requirement * memory : use sequential equal splits for recurrent modules (#16442) * rpc : update documentation (#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <[email protected]> * presets : fix pooling param for embedding models (#16455) * webui : added download action (#13552) (#16282) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <[email protected]> * server : add `/v1/health` endpoint (#16461) * server : add /v1/health endpoint * cont : update readme * llama : support LiquidAI LFM2-MoE hybrid model (#16464) * llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](https://github.com/huggingface/transformers/pull/41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback * ggml webgpu: profiling, CI updates, reworking of command submission (#16452) * Add profiling * More detailed profiling * Rework command submission to avoid global locks * Update wait handling * try new method of waiting on futures * Add serializing of command submission in some cases * Add new pool for timestamp queries and clean up logging * Serialize command submission in CI and leave a TODO note * Update webgpu CI * Add myself as WebGPU codeowner * Deadlock avoidance * Leave WebGPU/Vulkan CI serialized * Fix divide by 0 * Fix logic in division by inflight_threads * Update CODEOWNERS and remove serialize submit option * server : improve context checkpoint logic (#16440) * metal : mark FA blocks (#16372) * metal : better unroll in the FA kernels * metal : index FA blocks * tests : restore [no ci] * metal : prevent division by zero in FA kernels * metal : fix -INF detection logic * server : fix cancel pending task (#16467) Co-authored-by: DevAI <[email protected]> * Disable CUDA host buffers on integrated GPUs (#16308) * refactor: centralize CoT parsing in backend for streaming mode (#16394) * refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <[email protected]> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <[email protected]> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <[email protected]> --------- Co-authored-by: Aleksander Grygier <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> * model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367) * model: EmbeddingGemma sentence-transformers dense linear projections support * model: add support for EmbeddingGemma SentenceTransformers dense linear projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/ * model: add support for EmbeddingGemma SentenceTransformers dense linear projections - converting model with dense-layers is optional - introduced dense config params * Update convert_hf_to_gguf.py Co-authored-by: Daniel Bevenius <[email protected]> * fixed formatting issues * Update src/llama-graph.cpp Co-authored-by: Georgi Gerganov <[email protected]> * - removed pooling_type_opt, always allow overriding pooling_type - asserts checking dense features dims * fix python lint * fix ubuntu gcc build warning * - fixed thread-safety test - moved asserts to load_hparams * - tidying up code - simplifying graph-context expecting both dense weights * minor : add TODO --------- Co-authored-by: Daniel Bevenius <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> * [SYCL] refactor soft_max, add soft_max_back (#16472) * refactor to support soft_max_ext * fix error and support soft_max_back * rm unused functions * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]> * kleidiai: kernel interface refactoring (#16460) * CANN: Improve ACL graph matching (#16166) * CANN: improve ACL graph matching Record `ne` and `nb` information for src tensors and include them in the graph matching check. This enhances the robustness of ACL graph matching by preventing incorrect matches when src tensors share the same data address but differ in shape or stride. * CANN: add op_params match * ci: add ARM64 Kleidiai build and test support (#16462) * model-conversion : add support for SentenceTransformers (#16387) * model-conversion : add support for SentenceTransformers This commit adds support for models that use SentenceTransformer layers. The motivation for this is that if converted model includes any of the numbered layers specified in the original models repository then these changes enable these models to be used and verified. Currently the model-conversion only support the base model output without any of the additional transformation layers. Usage: Convert the model that also includes the SentenceTransformer layers: ```console (venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M" (venv) make embedding-convert-model ``` Verify the produced embeddings from the converted model against the original model embeddings: ```console (venv) make embedding-verify-logits-st ``` The original model can be run using SentenceTransformer: ```console (venv) make embedding-run-original-model-st ``` Run the converted model using "SentenceTransformer" layers whic enables pooling and normalization: ```console (venv) make embedding-run-converted-model-st ``` * add model-conversion example requirements * add support for -st flag in embedding model conversion This commit add support for the -st flag in the embedding model conversion script. This will enable models to be converted using sentence transformers dense layers. * No markdown in cot (#16483) * fix: let the model think in plaintext * chore: npm run format + npm run build * server : host-memory prompt caching (#16391) * minor : code style * server : fix prompt similarity calculation * server : initial host-memory prompt caching * cont * server : refactor * cont * cont : make the server task of the slot const * cont : minor [no ci] * server : cache prompts and checkpoints only for completion tasks * server : improve prompt caching logic * cont : fix check for number of cached prompts [no ci] * server : improve caching logic, add -cram CLI arg * server : print prompt mismatch info * cont : better naming [no ci] * server : improve prompt cache loading logic * server : add option to debug the slot contents (#16482) * server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <[email protected]> * server : add option to disable prompt cache --------- Co-authored-by: Xuan-Son Nguyen <[email protected]> * cpu : optimize the ggml NORM operation (#15953) * ggml-cpu: optimize norm operation to use intrinsics or Accelerate rename function add endif macro comment Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Aaron Teo <[email protected]> * implement s390x SIMD suggested by @taronaeo * add TODO comment * tidy up spaces --------- Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Aaron Teo <[email protected]> * webui: updated the chat service to only include max_tokens in the req… (#16489) * webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel * chore: update webui build output * cmake : Dont define XOPENSOURCE on AIX (#16481) * server : log requests to /v1/completions (#16495) * server : return HTTP 400 if prompt exceeds context length (#16486) In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases. * vocab : mark EOT token for Granite models (#16499) * vocab : mark EOT token for Granite models * sampling : fallback to EOS when EOT is not found * server : fix division by zero when reporting stats (#16501) * convert : correctly handle LLaMA tokenizer for Jamba (#16470) * fix: convert_hf_to_gguf - change Jamba non-sentencepiece mode (tokenizer.json) vocab construction * fix: convert_hf_to_gguf - jamba non-sentencepiece tokenizer to use _set_vocab_llama_hf func * fix: convert_hf_to_gguf - removed get_vocab_base_pre from jamba * cuda : avoid initializing unused devices (#16510) * server / ranking : add sorting and management of top_n (#16403) * server / ranking : add sorting and management of top_n * Make the retro compatible if no top_n will return all results here is a script to make some test ```script URL=${1:-http://127.0.0.1:8181} curl "$URL/v1/rerank" -H "Content-Type: application/json" \ -d '{ "model": "M", "query": "What is the recipe to make bread ?", "return_text" : true, "texts" : true, "top_n": 6, "documents": [ "voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel", "it is a bear", "bread recipe : floor, water, yest, salt", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.", "here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "recipe to make cookies : floor, eggs, water, chocolat", "here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt", "il fait tres beau aujourd hui", "je n ai pas faim, je ne veux pas manger", "je suis a paris" ] }' | jq ``` * use resize() instead for(...) * simplify top_n init since no need to return error result to test : ./tests.sh unit/test_rerank.py -v -x ==================================================== test session starts ===================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 8 items unit/test_rerank.py::test_rerank PASSED [ 12%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%] ===================================================== 8 passed in 4.31s ====================================================== * add rerank top_n unit test here is the result : ./tests.sh unit/test_rerank.py -v -x =================================================================== test session starts =================================================================== platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3 cachedir: .pytest_cache rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests configfile: pytest.ini plugins: anyio-4.11.0 collected 16 items unit/test_rerank.py::test_rerank PASSED [ 6%] unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%] unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%] unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%] unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%] unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%] unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%] unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%] unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%] unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%] unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%] unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%] unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%] unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%] unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%] unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%] =================================================================== 16 passed in 8.84s =================================================================== * editor config check fix * feat: render user content as markdown option (#16358) * feat: render user content as markdown option - Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options - Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle - Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise - chore: update webui build output * chore: update webui build output * metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494) * CUDA: faster tile FA, add oob checks, more HSs (#16492) * ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (#16518) The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion. Co-authored-by: Aaron <[email protected]> * hparams : add check for layer index in is_recurrent (#16511) * hparams : add check for layer index in is_recurrent This commit adds a check in the is_recurrent method to ensure that the provided layer index is within the valid range. The motivation for this change is to prevent potential out-of-bounds and also be consistent with other methods in the class that perform similar checks, like is_swa. * ggml : Fix FP16 ELU positive branch (#16519) Co-authored-by: Aaron <[email protected]> * common : update presets (#16504) * presets : add --embd-gemma-default and remove old embedding presets * presets : add gpt-oss presets * presets : add vision presets * cont : remove reasoning overrides [no ci] * cont : fix batch size for embedding gemma [no ci] * common : handle unicode during partial json parsing (#16526) * common : handle unicode during partial json parsing * common : set missing `ensure_ascii = true` during json dump * ci : add Vulkan on Ubuntu with default packages build (#16532) * ci: build Vulkan on Ubuntu with default packages * ci: disable tests in Vulkan build with default Ubuntu packages * [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521) * fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]> * webui: remove client-side context pre-check and rely on backend for limits (#16506) * fix: make SSE client robust to premature [DONE] in agentic proxy chains * webui: remove client-side context pre-check and rely on backend for limits Removed the client-side context window pre-check and now simply sends messages while keeping the dialog imports limited to core components, eliminating the maximum context alert path Simplified streaming and non-streaming chat error handling to surface a generic 'No response received from server' error whenever the backend returns no content Removed the obsolete maxContextError plumbing from the chat store so state management now focuses on the core message flow without special context-limit cases * webui: cosmetic rename of error messages * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <[email protected]> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <[email protected]> * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <[email protected]> * metal : add opt_step_adamw and op_sum (#16529) * scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI * CANN: Update several operators to support FP16 data format (#16251) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]> * ggml : fix scalar path for computing norm (#16558) * metal: add support for opt_step_sgd (#16539) * metal: add support for opt_step_sgd * add newline to pass EditorConfig check * fix: add remark plugin to render raw HTML as literal text (#16505) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output * CANN: fix CPU memory leak in CANN backend (#16549) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed. * ggml : fix build broken with -march=armv9-a on MacOS (#16520) * ggml : fix build broken with -march=armv9-a on MacOS Signed-off-by: Jie Fu <[email protected]> * Add #pragma message Signed-off-by: Jie Fu <[email protected]> * Address review comment. Signed-off-by: Jie Fu <[email protected]> * Update ggml/src/ggml-cpu/ggml-cpu.c --------- Signed-off-by: Jie Fu <[email protected]> Co-authored-by: Diego Devesa <[email protected]> * CUDA: fix numerical issues in tile FA kernel (#16540) * opencl: fix build targeting CL 2 (#16554) * graph : support cacheless embeddings with FA and iSWA (#16528) * graph : support cacheless embeddings with FA and iSWA * cont : deduplicate mask creation * cont : fix name * metal : FA support F32 K and V and head size = 32 (#16531) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci] * server : dynamic token limit for prompt cache (#16560) * server : dynamic token limit for prompt cache * cont : print estimated token limit * cuda : remove legacy copy-op pointer indirection code (#16485) * remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function * CUDA: add fp kernel for larger batch size MoE (#16512) * CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks * CUDA: use fastdiv + ggml_cuda_mad for mmvf (#16557) * CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code * CUDA: enable FA for FP32 KV cache (#16546) * vulkan: Improve build time for MSVC (#16545) Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel. * vulkan: Support FA with K/V in F32 (#16543) * CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (#16577) * vulkan: Add ACC_TYPE_VEC2 implementation (#16203) Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]> * metal : avoid using Metal's gpuAddress property (#16576) * metal : avoid using Metal's gpuAddress property * metal : fix rope kernels buffer check * server : fix mtmd checkpoints (#16591) * CUDA: Changing the CUDA scheduling strategy to spin (#16585) * CUDA set scheduling strategy to spinning for cc121 * Using prop.major and prop.minor, include HIP and MUSA * Exclude HIP and MUSA * Remove trailing whitespace Co-authored-by: Johannes Gäßler <[email protected]> * Remove empty line Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]> * llama-quant: add support for mmproj (#16592) * llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <[email protected]> * server : fix img token logs (#16595) * metal: optimise `GGML_OP_SUM` (#16559) * optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <[email protected]> * Add server-driven parameter defaults and syncing (#16515) * opencl: fix FA for f32 (#16584) * opencl: add q8_0 mm support (#16469) * opencl: add mm_q8_0_f32 * opencl: fix data loading for incomplete tile * opencl: use q8_0 mm for larger matrix * opencl: add some tests to cover the path * cpu : add FLOOR, CEIL, ROUND and TRUNC unary operators (#16083) * CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators - Added the operators to unary op enum - Implemented API functions - Implemented forward and unary-op logic in CPU backend - Updated ggml_get_n_tasks - Updated operators names array and static_assert - Updated docs and enabled automatic tests * docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h * chore: remove trailing whitespace from ggml.h * Remove unresolved merge markers * Apply review suggestions: cleanup formatting, enum order and leftover artifacts * Regenerate ops.md using create_ops_docs.py * gguf-py : add support for endian conversion of BF16 data (#16594) BF16 requires special handling in this script while it's a 2-bytes data, but view is 1-byte by default. Switch to correct view before attempting byteswapping. With this change correctly byteswapping models like Meta-Llama-3-8B-Instruct-bf16-GGUF should be possible. * SYCL: Add GGML_OP_MEAN operator support (#16009) * SYCL: Add GGML_OP_MEAN operator support * SYCL: Fix formatting for GGML_OP_MEAN case * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> * ggml-cpu: replace putenv with setenv for const-correctness (#16573) ## Why it failed When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \ cmake --build ../llama.cpp.build/ ... /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’: /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers] 3572 | putenv("KMP_BLOCKTIME=200"); // 200ms | ^~~~~~~~~~~~~~~~~~~ In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3, from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6: /usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’ 786 | extern int putenv (char *__string) __THROW __nonnull ((1)); | ~~~~~~^~~~~~~~ cc1: some warnings being treated as errors ninja: build stopped: subcommand failed. ``` The issue is that putenv() expects a non-const char * but receives a string literal (const char *). ## How to fix This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0). Benefits of setenv(): - Accepts const char * parameters (no qualifier warnings) - Makes copies of the strings (safer memory handling) - The third parameter (0) ensures we don't overwrite if already set * common : Update the docs on -t --threads (#16236) * Update the docs on -t --threads * Revert "Update the docs on -t --threads" This reverts commit eba97345e2c88d8ca510abec87d00bf6b9b0e0c2. * docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores * Update arg.cpp * CANN: format code using .clang-format (#15863) This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <[email protected]> * sycl : add ARANGE operator (#16362) * SYCL: update element-wise ops and presets * clean arange * Re-trigger CI --------- Co-authored-by: Gitty Burstein <[email protected]> * fix: added a normalization step for MathJax-style \[\] and  delimiters (#16599) * fix: added a normalization step for MathJax-style \[\] and  delimiters So inline and block equations are converted before KaTeX rendering, enabling proper display of model-generated LaTeX in the WebUI * chore: update webui build output * mtmd : support home-cooked Mistral Small Omni (#14928) * SYCL SET operator optimized for F32 tensors (#16350) * SYCL/SET: implement operator + wire-up; docs/ops updates; element_wise & ggml-sycl changes * sycl(SET): re-apply post-rebase; revert manual docs/ops.md; style cleanups * move SET op to standalone file, GPU-only implementation * Update SYCL SET operator for F32 * ci: fix editorconfig issues (LF endings, trailing spaces, final newline) * fixed ggml-sycl.cpp --------- Co-authored-by: Gitty Burstein <[email protected]> * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * metal : add `CONV_TRANSPOSE_2D` (#16542) * initial: headers and m…

rgerganov requested review from ggerganov and ngxson as code owners October 9, 2025 13:40

github-actions bot added examples python python script changes server labels Oct 9, 2025

rgerganov force-pushed the srv-ctx-exceed branch 2 times, most recently from aac559d to 1d8b16c Compare October 9, 2025 15:41

ggerganov reviewed Oct 9, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

rgerganov force-pushed the srv-ctx-exceed branch from 1d8b16c to d08f91a Compare October 10, 2025 07:55

ggerganov approved these changes Oct 10, 2025

View reviewed changes

ngxson reviewed Oct 10, 2025

View reviewed changes

ngxson approved these changes Oct 10, 2025

View reviewed changes

ngxson merged commit 68ee98a into ggml-org:master Oct 10, 2025
71 checks passed

rgerganov mentioned this pull request Oct 10, 2025

Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk #14566

Closed

oobabooga mentioned this pull request Oct 10, 2025

Give feedback to user when llama server request exceeds available context size oobabooga/text-generation-webui#7257

Closed

mamei16 mentioned this pull request Oct 12, 2025

Log error when llama-server request exceeds context size oobabooga/text-generation-webui#7263

Merged

1 task

ngxson mentioned this pull request Nov 11, 2025

server: (refactor) implement generator-based API for task results #17174

Merged

DajanaV mentioned this pull request Nov 11, 2025

UPSTREAM PR #17174: server: (refactor) implement generator-based API for task results auroralabs-loci/llama.cpp#170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : return HTTP 400 if prompt exceeds context length #16486

server : return HTTP 400 if prompt exceeds context length #16486

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

ngxson commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

Uh oh!

ngxson Oct 10, 2025

Uh oh!

rgerganov Oct 10, 2025

Uh oh!

ngxson Oct 10, 2025

Uh oh!

ngxson Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server : return HTTP 400 if prompt exceeds context length #16486

server : return HTTP 400 if prompt exceeds context length #16486

Uh oh!

Conversation

rgerganov commented Oct 9, 2025

Uh oh!

ngxson commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants