llama: Enable K-shift for quantized KV cache for cuda #760

firecoperana · 2025-09-04T22:52:02Z

This fixed the crash when doing context shift for quantized kv cache. Before this PR, I can only use fp16 for kv cache . Tested with both q8_0 and q4_0 and don't see crash.

ggml-org/llama.cpp#9571
It will fail on unsupported backends or quant types.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types.

llama: enable K-shift for quantized KV cache for cuda

21c39d7

cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types.

firecoperana requested review from ikawrakow and saood06 September 4, 2025 22:52

ikawrakow approved these changes Sep 5, 2025

View reviewed changes

ikawrakow merged commit cec8b70 into main Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: Enable K-shift for quantized KV cache for cuda #760

llama: Enable K-shift for quantized KV cache for cuda #760

Uh oh!

firecoperana commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama: Enable K-shift for quantized KV cache for cuda #760

llama: Enable K-shift for quantized KV cache for cuda #760

Uh oh!

Conversation

firecoperana commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants