Skip to content

Conversation

@firecoperana
Copy link
Collaborator

This fixed the crash when doing context shift for quantized kv cache. Before this PR, I can only use fp16 for kv cache . Tested with both q8_0 and q4_0 and don't see crash.

ggml-org/llama.cpp#9571
It will fail on unsupported backends or quant types.

cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.
@ikawrakow ikawrakow merged commit cec8b70 into main Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants