Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,15 @@ static inline int64_t ggml_ne(const ggml_tensor * tensor, int dim) {
return tensor->ne[dim];
}

static inline bool is_whisper_model(const struct ggml_tensor* op) {
const int64_t n_dims = op->src[0]->ne[0];
if (n_dims == 384 || n_dims == 512 || n_dims == 768 ||
n_dims == 1024 || n_dims == 1280) {
return true;
}
return false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very hacky - what is the reason to add this check?

Copy link
Collaborator Author

@chaxu01 chaxu01 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is added to ensure KleidiAI backend only handles GET_ROWS operations for Whisper models, not for LLaMA models becaue of the whisper's specific usage pattern (GET_ROWS -> MUL_MAT in decoder which can be accelerated by the kleidiai kernels). Those dimensions are specific to Whisper models used across different Whisper model sizes.
- 384: Whisper tiny
- 512: Whisper base
- 768: Whisper small
- 1024: Whisper medium
- 1280: Whisper large

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we need to do the check. What is a failure case that you observe? For example if we have a non-Whisper LLM that calls ggml_get_rows with n_dims = 256 why would we not want to use this path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure case we observed occurs in whisper.cpp after PR #15118 was merged. Specifically, a GGML_ASSERT is triggered when the newly added GET_ROWS support check fails for the KleidiAI backend.

The main reason for enabling GET_ROWS support for Whisper in KleidiAI is that Whisper reuses the same weight tensor for both GET_ROWS and MUL_MAT in the decoder. If GET_ROWS isn't supported, it falls back to CPU — which in turn forces the shared tensor to fall back as well, disabling KleidiAI acceleration for MUL_MAT.

In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.

The n_dims check is a lightweight way to detect Whisper models based on known decoder embedding sizes (384–1280). Happy to revisit or generalize this approach if other models benefit from the same pattern in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov, just wanted to follow up on this one.

The main goal here is to ensure Whisper models benefit from KleidiAI acceleration by keeping GET_ROWS + MUL_MAT on the backend, while avoiding unnecessary repacking for LLaMA and similar models. The n_dims check was meant as a lightweight heuristic for Whisper’s known decoder embedding sizes. Happy to adjust the approach if you prefer a different way of handling Whisper-specific GET_ROWS acceleration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, understand what is the root issue now - same tensor being used both for GET_ROWS and MUL_MAT. I don't think this is the right solution. We should not make such assumptions about the tensor shapes in the underlying implementation as it is very fragile.

In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.

Do you have an estimate of how large the negative impact is in such cases?

In hindsight the logic for repacking tensors used with GET_ROWS was probably premature. The repacking mechanism was designed with a single op in mind. I.e. if the single op that uses the tensor would benefit from repacking then we allow repacking.

In contrast, for this use case, the condition to repack now depends also on the next ops in the graph. This is something we overlooked when implementing the GET_ROWS repacking support and probably have to take a step back on it and think of a better way to support these use cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the shape-based check is fragile and not the right direction long term.

I have done some benchmarks. Since the repacking process happens during model load, I wasn’t able to observe any negative impact on an M4Pro when enabling GET_ROWS for LLaMA and similar models. Even the model load time didn’t show any noticeable change in llama.cpp.

Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.

./bin/llama-bench -m ./models/llama-2-7b-q4_0.gguf -ngl 0 -t 4

W/O GET_ROWS
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         85.01 ± 0.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         40.08 ± 0.92 |

build: d93dcc27 (6287)

W/ GET_ROWS
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         85.60 ± 0.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         42.49 ± 0.63 |

build: d93dcc27 (6287)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.

Ok, lets avoid the special-cases for now since the impact does not seem to be significant. If we don't find cases in which the impact is measurable, then I think it would be OK to keep the GET_ROWS repacking as it is.

template<typename Ret, typename Variant, typename... Args>
static Ret variant_call(const Variant & var, Args&&... args) {
return std::visit([&](auto&& func) -> Ret {
Expand Down Expand Up @@ -510,12 +519,12 @@ class extra_buffer_type : ggml::cpu::extra_buffer_type {
op->src[0]->buffer &&
(ggml_n_dims(op->src[0]) == 2) &&
op->src[0]->buffer->buft == ggml_backend_cpu_kleidiai_buffer_type() && ctx.kernels) {
if (op->op == GGML_OP_GET_ROWS && op->src[1]->ne[0] != 8) {
return false;
}
if (op->src[1]->buffer && !ggml_backend_buft_is_host(op->src[1]->buffer->buft)) {
return false;
}
if (op->op == GGML_OP_GET_ROWS) {
return is_whisper_model(op);
}
if ((op->src[1]->type == GGML_TYPE_F32 || op->src[1]->type == GGML_TYPE_I32) &&
ggml_ne(op->src[1], 2) == 1 && ggml_ne(op->src[1], 3) == 1) {
return true;
Expand Down
Loading