Rebase temp-load-from-buffer and merge into master #7

jesusmb1995 · 2025-08-28T08:04:16Z

Builds changes from #1 on top of Tether synced fork. After #4 was merged to sync with upstream and add Tether fork changes.

Git diff can be used to check the new re-based branch includes same changes of temp-load-from-buffer, changes shown should be those from the re-base:

➜  llamacpp_tether git:(temp-load-from-buffer-rebased-QVAC4552) ✗ git diff --name-only temp-load-from-buffer-rebased-QVAC4552 tetherto/temp-
load-from-buffer | tee
CMakeLists.txt
cmake/llama-config.cmake.in
common/CMakeLists.txt
ggml/CMakeLists.txt
ggml/cmake/ggml-config.cmake.in
ggml/src/CMakeLists.txt
ggml/src/ggml-vulkan/CMakeLists.txt
src/CMakeLists.txt
tools/mtmd/CMakeLists.txt

First commit of this PR is based on:

commit 88d711fad2ac0a1392eb916663448cbf71d64b0a
Author: Jesús <[email protected]>
Date:   Wed Jul 16 11:07:22 2025 +0200

    [common] Pure interface for files
    
    Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

commit ab269c4c47b19117b956d65539b733c0fe136d33 (tetherto/master, tetherto/HEAD)
Merge: 4fb255655 ce648804b
Author: Yury Samarin <[email protected]>
Date:   Thu Aug 28 08:59:08 2025 +0300

    Merge pull request #4 from jpgaribotti/QVAC-4552
    
    QVAC-4552: Sync port with upstream version b5932

commit ce648804b2881100bd83bd488c8e38a715e7431c (tag: b5932.0.0, jpgaribotti/QVAC-4552)
Author: Juan Pablo Garibotti Arias <[email protected]>
Date:   Wed Aug 13 12:42:39 2025 +0200

    Export mtmd target

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1211101393331306

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).

Add a flag to the tool to ensure some tensor names are always followed by another tensor and not at the end of a shard. This ensures the shard will not be released when the tensor is processed, and avoid missing-file failures of duplicate tensors that are re-referenced a few tensors later (typically token_embd.weight / output).

Show to which shards belongs each tensor

- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff. - Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.

Override the pure virtual interface with a class that can operate on a single memory buffer.

Auxiliary function to convert a list of C strings to a vector of C++ strings.

Add new GGUF reader implementation that can read metadata from a memory buffer.

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

Handles the logic for incrementally loading files and tensors is model shards.

Refactor backend buffer creation (for model loading) into functions.

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

Adapt the loader and model load to incrementally load files and upload tensors.

Add functions to Llama.cpp public headers to asynchronously load shards.

Split some common loading functionallity. This will help in the memory loading tests.

Add a submodule with re-usable code for tests.

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

Add some automatic tests that load from memory (single buffer or multiple async splits)

github-actions bot added examples ggml testing build labels Aug 28, 2025

jesusmb1995 self-assigned this Aug 28, 2025

jesusmb1995 requested a review from jpgaribotti August 28, 2025 08:18

jesusmb1995 added 24 commits August 28, 2025 10:33

[common] Pure interface for files

88d711f

Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)

[common] Compile time debug logs

ee746b5

Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.

[aux] Test full load from disk

04b1287

This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.

[aux] verbose gguf split

f7390c7

Show to which shards belongs each tensor

[mbuffer] Llama file buffer implementation

70f99a9

Override the pure virtual interface with a class that can operate on a single memory buffer.

[refactor] C splits into C++

070a6fc

Auxiliary function to convert a list of C strings to a vector of C++ strings.

[common] GGUF reader from memory

fe8c282

Add new GGUF reader implementation that can read metadata from a memory buffer.

[refactor][mbuffer] File load from variant

194595c

- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).

[refactor] Process file method

bf67fef

Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.

[mbuffer] Expose single-buffer loading to Llama interface

9819612

Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.

[fbuffers] Future file buffer implementation

b6022a6

A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.

[fbuffers] Incremental loading of future files

6a544c3

Handles the logic for incrementally loading files and tensors is model shards.

[refactor] Create backend buffers

01f4d91

Refactor backend buffer creation (for model loading) into functions.

[refactor] Load all data

dc4e615

- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.

[fbuffers] Incremental model load

f5afb00

Adapt the loader and model load to incrementally load files and upload tensors.

[fbuffers] Expose async interface

78ea018

Add functions to Llama.cpp public headers to asynchronously load shards.

[refactor] Increase common loading granularity

8409a09

Split some common loading functionallity. This will help in the memory loading tests.

[aux] Common test

2cf8aef

Add a submodule with re-usable code for tests.

[aux] Memory example (embedding)

fc01500

Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.

[aux] Memory example (simple)

37f9d6e

Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.

[aux] Auto. memory loading tests

b6d441b

Add some automatic tests that load from memory (single buffer or multiple async splits)

jesusmb1995 force-pushed the temp-load-from-buffer-rebased-QVAC4552 branch from bbd1b71 to b6d441b Compare August 28, 2025 08:34

jesusmb1995 requested a review from olek-tether August 28, 2025 08:41

jpgaribotti approved these changes Aug 28, 2025

View reviewed changes

jesusmb1995 requested review from yuranich and vigan-abd August 28, 2025 09:13

yuranich approved these changes Aug 28, 2025

View reviewed changes

olek-tether approved these changes Aug 28, 2025

View reviewed changes

olek-tether merged commit e394035 into tetherto:master Aug 28, 2025
9 of 47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase temp-load-from-buffer and merge into master #7

Rebase temp-load-from-buffer and merge into master #7

Uh oh!

jesusmb1995 commented Aug 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Rebase temp-load-from-buffer and merge into master #7

Rebase temp-load-from-buffer and merge into master #7

Uh oh!

Conversation

jesusmb1995 commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jesusmb1995 commented Aug 28, 2025 •

edited

Loading