Skip to content

Conversation

@l3utterfly
Copy link
Contributor

It seems currently the input size parameter when getting/setting tensors is ignored. This crashes when attempting to save/load state data, because we only save/load filled kv cells during llama_kv_cache::state_read_data and llama_kv_cache::state_write_data. The cache saving/loading wants to read partial tensors, so it fails the assert in ggml_backend_hexagon_buffer_get_tensor and ggml_backend_hexagon_buffer_set_tensor.

This PR updates get/set tensor to read and repack partial rows based on the passed in size input. This is tested to allow saving and loading kv caches successfully on an S25+ Ultra.

allows partial repacking/copying when get tensor size is smaller than the actual tensor
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 29, 2025
@max-krasnyansky
Copy link
Collaborator

@l3utterfly sorry for the delayed ack. I want to test it out on my setup but keep getting sidetracked.
I'm a bit surprised that we set set_tensor calls for the quantized tensors during save/restore.
btw What's the easiest way to trigger/test this with the cli tools, llama-cli...--prompt-cache?

@l3utterfly
Copy link
Contributor Author

@max-krasnyansky yeah, that would be the best way to trigger save/load cache.

I believe the run-cli.sh currently sets kv cache quants to Q8_0, so it reads and sets quantised tensors.

@max-krasnyansky
Copy link
Collaborator

@max-krasnyansky yeah, that would be the best way to trigger save/load cache.

I believe the run-cli.sh currently sets kv cache quants to Q8_0, so it reads and sets quantised tensors.

Ah. That makes sense. Thanks will test asap and merge. Thank you thank you.

@max-krasnyansky
Copy link
Collaborator

@l3utterfly
Everything looks good in my tests as well.
Tested --prompt-cache with GPT-OSS-20B (MXFP4) and Q4/Q8 models.

Somehow the commit ends up with a duplicate copy of the repack_mxfp4_mxfp4x4x2 function.
I had to remove it by hand to compile. Perhaps, a rebase glitch?
Can you please rebase again and we'll merge it.

 ~/src/llama.cpp-hexagon$ git grep repack_mxfp4
convert_hf_to_gguf.py:    def repack_mxfp4(self, new_name: str, blocks: Tensor, scales: Tensor):
convert_hf_to_gguf.py:                self.repack_mxfp4(new_name, blocks0, data_torch)
convert_hf_to_gguf.py:                self.repack_mxfp4(new_name_gate, blocks0, scales0)
convert_hf_to_gguf.py:                self.repack_mxfp4(new_name_up, blocks1, scales1)
ggml/src/ggml-hexagon/ggml-hexagon.cpp:static void repack_mxfp4_mxfp4x4x2(ggml_tensor * t, const void * data, size_t size) { <<<
ggml/src/ggml-hexagon/ggml-hexagon.cpp:static void repack_mxfp4_mxfp4x4x2(ggml_tensor * t, const void * data, size_t size) { <<<
ggml/src/ggml-hexagon/ggml-hexagon.cpp:static void repack_mxfp4x4x2_mxfp4(void * data, const ggml_tensor * t, size_t size) {

@l3utterfly l3utterfly requested a review from lhez as a code owner October 31, 2025 02:49
@l3utterfly
Copy link
Contributor Author

@max-krasnyansky fixed! Thanks for testing!

@max-krasnyansky max-krasnyansky merged commit 13002a0 into ggml-org:master Oct 31, 2025
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants