Master secure ggml rpc #16281

struct · 2025-09-26T21:10:29Z

Make sure to read the contributing guidelines before submitting a PR

* fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser

* * llama-bench: add --devices support - Support --devices same as llama-server - Provide for benchmarking different device combinations - Include --list-devices like llama-server for convenience * fix: field display ordering restored * fix: integrated the rpc devices - aimed to mimic the server as much as possible * cleanup: defaults for list-devices - handle dup device listing with RPC * cleanup: remove dup device load calls * docs: update llama-bench - added the recently added n-cpu-moe option to the docs while in there * llama-bench: rpc device simplification * rpc servers unify with other devices earlier, simplifying code * --list-devices made stateless and simpler * various cleanup

…aming (ggml-org#16109) * server: fix SSE and OpenAI compatibility for error messages when streaming * server: remove obsolete event parameter and use required data fieldname instead

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

* ggml : introduce semantic versioning This commit introduces semantic versioning for the GGML library. The motivation for this is that the current versioning, using build numbers, makes it difficult to track changes and releases for projects that use ggml. The release steps are the following: 1. Sync the changes from llama.cpp using sync-llama-am.sh and after the PR has been approved and merged move to step 2. 2. Run scripts/release.sh and specify the type of release, major, minor, or patch. This script will handle incrementing the version (major|minor|patch), create a new commit with the version change, create a tag for the version, and prepare for the next development iteration. 3. Inspect the commits/tag and push to master. This will trigger the github release workflow which is triggered for new tags which will then publish a new release on github. Example usage: ```console $ ./scripts/release.sh major --dry-run [dry-run] - No changes will be made Step 1: Reading current version... Current version: 0.9.0-dev New release version: 1.0.0 Step 2: Updating version in ggml/CMakeLists.txt... [dry-run] Would update GGML_VERSION_MAJOR to 1 [dry-run] Would update GGML_VERSION_MINOR to 0 [dry-run] Would update GGML_VERSION_PATCH to 0 [dry-run] Would remove -dev suffix Step 3: Committing version bump... [dry-run] Would commit: 'ggml : bump version to 1.0.0' Step 4: Creating git tag... [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0' Step 5: Preparing for next development cycle... [dry-run] Would update GGML_VERSION_MINOR to 1 [dry-run] Would add -dev suffix back Step 6: Committing development version... [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev' [dry-run] Summary (no changes were made): • Would have released version: 1.0.0 • Would have created tag: v1.0.0 • Would have set next development version: 1.1.0-dev ``` Refs: ggml-org/ggml#1333 * ggml: create branch for release candidate and check master * ggml : sign the git tag

…ICS_BIT_KHR (ggml-org#16086)

…#16059) * vulkan: optimize UMA buffer operations and fix driver hangs The previous implementation was blocking the GPU for extended periods, causing the i915 driver to reset the context due to the hangcheck protection. [32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114] [32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang * vulkan: implement deferred_memset on UMA --------- Signed-off-by: Giuseppe Scrivano <[email protected]>

* ci : migrate ggml ci to a self-hosted runners * ci : add T4 runner * ci : add instructions for adding self-hosted runners * ci : disable test-backend-ops from debug builds due to slowness * ci : add AMD V710 runner (vulkan) * cont : add ROCM workflow * ci : switch to qwen3 0.6b model * cont : fix the context size

* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching

* ci : adjust params for less runtime * ci : gate BF16 on some hardware * ci : move extra tests to Arm runner

This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.

* ci : switch from gemma to qwen3 0.6b * ci : use smaller model for some tests

* contrib : update roles * contrib : merge PR sections + add link to CI instructions Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.

…ggml-org#16124) * claim responsibility for ci, gguf-py and convert * add myself to various src/llama- files

* Vulkan: add conv_transpose_2d operation * Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L) * Vulkan: fix incorrect indentation in conv_transpose_2d shader * Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation * Vulkan: revert the order of the index calculation and bound check in conv_2d shader * Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation. * Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.

* ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h

…rg#16123) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload

These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i];

* common : use the json parser Signed-off-by: Adrien Gallouët <[email protected]> * common : enable --offline mode without CURL support This change refactors the download logic to properly support offline mode even when the project is built without CURL. Without this commit, using `--offline` would give the following error: error: built without CURL, cannot download model from the internet even if all the files are already cached. Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

…sage sizing (ggml-org#16076)

…gml-org#16157) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link

…16255) * webui: allow viewing conversations and sending messages even if llama-server is down - Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable. - Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation. - Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data. * feat: Add UI for `props` endpoint unavailable + cleanup logic * webui: extend cached props fallback to offline errors Treat connection failures (refused, DNS, timeout, fetch) the same way as server 5xx so the warning banner shows up when cache is available, instead of falling back to a full error screen. * webui: Left the chat form enabled when a server warning is present so operators can keep sending messages e.g., to restart the backend over llama-swap, even while cached /props data is in use * chore: update webui build output --------- Co-authored-by: Pascal <[email protected]>

* feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output

* devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <[email protected]> * devops: stop warnings as errors Signed-off-by: Aaron Teo <[email protected]> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <[email protected]> * devops: going the llama macro route Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <[email protected]> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <[email protected]> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <[email protected]> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <[email protected]> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: try rust actions runner Signed-off-by: Aaron Teo <[email protected]> * Revert "devops: try rust actions runner" This reverts commit 3f8db04. Signed-off-by: Aaron Teo <[email protected]> * devops: try a different path for rust Signed-off-by: Aaron Teo <[email protected]> * devops: dump home directory and user info Signed-off-by: Aaron Teo <[email protected]> * devops: install gguf-py only Signed-off-by: Aaron Teo <[email protected]> * devops: missed relative path Signed-off-by: Aaron Teo <[email protected]> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <[email protected]> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <[email protected]> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <[email protected]> * Simplify byteswap command Signed-off-by: Aaron Teo <[email protected]> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <[email protected]> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <[email protected]> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <[email protected]> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <[email protected]> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <[email protected]> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <[email protected]> * devops: fix test does not exist Signed-off-by: Aaron Teo <[email protected]> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <[email protected]> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <[email protected]> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <[email protected]> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <[email protected]> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <[email protected]> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <[email protected]> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <[email protected]> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <[email protected]> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <[email protected]> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <[email protected]> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Aleksei Nikiforov <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

struct and others added 30 commits September 16, 2025 07:54

first commit of secure rpc

2c9467c

Merge branch 'ggml-org:master' into master_secure_ggml_rpc

6dbf19e

chat: Fix streaming parser for granite models (ggml-org#15682)

f20a7a4

* fix(chat): fix streaming parser for granite models * tests: add test cases for Granite models chat parser

server: fix SSE and OpenAI compatibility for error messages when stre…

c0e18f5

…aming (ggml-org#16109) * server: fix SSE and OpenAI compatibility for error messages when streaming * server: remove obsolete event parameter and use required data fieldname instead

vulkan: use vec dot for matrix matrix multiplications (ggml-org#16056)

b22dc26

* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions * use fma instead of dot to fix Nvidia and Apple performance issues

CUDA : conditionally add cuda architectures (ggml/1341)

bdae4cc

sync : ggml

a019606

vulkan: fix validation error about VK_PIPELINE_CREATE_CAPTURE_STATIST…

9e6228f

…ICS_BIT_KHR (ggml-org#16086)

ci : add label for the RISC-V runner (ggml-org#16150)

bb0c2d5

opencl: initial q8_0 mv support (ggml-org#15732)

86b1356

opencl: fix concat crash on win arm64 with Adreno (ggml-org#15944)

d10b97b

vulkan: vec dot matrix multiplication fix (ggml-org#16151)

838e509

* vulkan: fix matrix multiplication index calculation for odd m/n and odd k in combination with batching * add odd m/n + odd k test with batching

ci : adjust params for less runtime (ggml-org#16167)

618613e

* ci : adjust params for less runtime * ci : gate BF16 on some hardware * ci : move extra tests to Arm runner

vulkan: add RTE variants of exp shader (ggml-org#16165)

2c4874b

This fixes some failures on Turing where "round to zero" rounds to the max f16 value but the CPU reference value is infinite.

ci : use smaller model (ggml-org#16168)

cb11841

* ci : switch from gemma to qwen3 0.6b * ci : use smaller model for some tests

ci : remove vulkaninfo calls (ggml-org#16169)

4d55235

contrib : update roles (ggml-org#16113)

9519b44

* contrib : update roles * contrib : merge PR sections + add link to CI instructions Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.

codeowners : claim responsibility for ci, models, gguf-py and convert (…

1a4caf9

…ggml-org#16124) * claim responsibility for ci, gguf-py and convert * add myself to various src/llama- files

codeowners : update ownership for @ngxson and @allozuar (ggml-org#16128)

329780f

ggml : add ggml_op_is_empty (ggml-org#16122)

1b23dd5

* ggml : add ggml_op_is_empty * ggml : move to ggml-impl.h

ggml : extend ggml_can_fuse to work with non-sequential nodes (ggml-o…

d0a69b1

…rg#16123) * ggml : extend ggml_can_fuse to work with non-sequential nodes in the graph * cont : fix wrong bounds check condition * cont : remove unnecessary overload

common : remove unused local variables (ggml-org#16140)

3823c14

These two local variables 'arg' and 'arg_prefix' have been overriden by: 1. for (const auto & arg : opt.args) 2. for (int i = 1; i < argc; i++) { const std::string arg_prefix = "--"; std::string arg = argv[i];

embedding : fix typos in README (ggml-org#16171)

b4a3a10

webui : fix handling incomplete chunks (ggml-org#16107)

4653d80

allozaur and others added 5 commits September 26, 2025 17:04

Always show message actions for mobile UI + improvements for user mes…

8484427

…sage sizing (ggml-org#16076)

webui: switch to hash-based routing (alternative of ggml-org#16079) (g…

d67e0f3

…gml-org#16157) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link

Enhance text file detection logic for file attachments (ggml-org#16199)

d71dc6c

* feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output

struct requested review from 0cc4m, CISC, JohannesGaessler, allozaur, danbev, ggerganov, ngxson, rgerganov, slaren and taronaeo as code owners September 26, 2025 21:10

struct closed this Sep 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master secure ggml rpc #16281

Master secure ggml rpc #16281

Uh oh!

struct commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants

Master secure ggml rpc #16281

Master secure ggml rpc #16281

Uh oh!

Conversation

struct commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

36 participants