forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 9
Rebase temp-load-from-buffer and merge into master #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
olek-tether
merged 24 commits into
tetherto:master
from
jesusmb1995:temp-load-from-buffer-rebased-QVAC4552
Aug 28, 2025
Merged
Rebase temp-load-from-buffer and merge into master #7
olek-tether
merged 24 commits into
tetherto:master
from
jesusmb1995:temp-load-from-buffer-rebased-QVAC4552
Aug 28, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)
Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.
This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.
The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).
Add a flag to the tool to ensure some tensor names are always followed by another tensor and not at the end of a shard. This ensures the shard will not be released when the tensor is processed, and avoid missing-file failures of duplicate tensors that are re-referenced a few tensors later (typically token_embd.weight / output).
Show to which shards belongs each tensor
- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff. - Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.
Override the pure virtual interface with a class that can operate on a single memory buffer.
Auxiliary function to convert a list of C strings to a vector of C++ strings.
Add new GGUF reader implementation that can read metadata from a memory buffer.
- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).
Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.
Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.
A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.
Handles the logic for incrementally loading files and tensors is model shards.
Refactor backend buffer creation (for model loading) into functions.
- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.
Adapt the loader and model load to incrementally load files and upload tensors.
Add functions to Llama.cpp public headers to asynchronously load shards.
Split some common loading functionallity. This will help in the memory loading tests.
Add a submodule with re-usable code for tests.
Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.
Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.
Add some automatic tests that load from memory (single buffer or multiple async splits)
bbd1b71
to
b6d441b
Compare
jpgaribotti
approved these changes
Aug 28, 2025
yuranich
approved these changes
Aug 28, 2025
olek-tether
approved these changes
Aug 28, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Builds changes from #1 on top of Tether synced fork. After #4 was merged to sync with upstream and add Tether fork changes.
Git diff can be used to check the new re-based branch includes same changes of temp-load-from-buffer, changes shown should be those from the re-base:
First commit of this PR is based on: