-
Notifications
You must be signed in to change notification settings - Fork 13.3k
kleidiai: fix GGML_ASSERT(*cur_backend_id != -1) failed #15614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
static inline bool is_whisper_model(const struct ggml_tensor* op) { | ||
const int64_t n_dims = op->src[0]->ne[0]; | ||
if (n_dims == 384 || n_dims == 512 || n_dims == 768 || | ||
n_dims == 1024 || n_dims == 1280) { | ||
return true; | ||
} | ||
return false; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very hacky - what is the reason to add this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is added to ensure KleidiAI backend only handles GET_ROWS operations for Whisper models, not for LLaMA models becaue of the whisper's specific usage pattern (GET_ROWS -> MUL_MAT in decoder which can be accelerated by the kleidiai kernels). Those dimensions are specific to Whisper models used across different Whisper model sizes.
- 384: Whisper tiny
- 512: Whisper base
- 768: Whisper small
- 1024: Whisper medium
- 1280: Whisper large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why we need to do the check. What is a failure case that you observe? For example if we have a non-Whisper LLM that calls ggml_get_rows
with n_dims = 256
why would we not want to use this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure case we observed occurs in whisper.cpp after PR #15118 was merged. Specifically, a GGML_ASSERT is triggered when the newly added GET_ROWS support check fails for the KleidiAI backend.
The main reason for enabling GET_ROWS support for Whisper in KleidiAI is that Whisper reuses the same weight tensor for both GET_ROWS and MUL_MAT in the decoder. If GET_ROWS isn't supported, it falls back to CPU — which in turn forces the shared tensor to fall back as well, disabling KleidiAI acceleration for MUL_MAT.
In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.
The n_dims check is a lightweight way to detect Whisper models based on known decoder embedding sizes (384–1280). Happy to revisit or generalize this approach if other models benefit from the same pattern in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov, just wanted to follow up on this one.
The main goal here is to ensure Whisper models benefit from KleidiAI acceleration by keeping GET_ROWS + MUL_MAT on the backend, while avoiding unnecessary repacking for LLaMA and similar models. The n_dims check was meant as a lightweight heuristic for Whisper’s known decoder embedding sizes. Happy to adjust the approach if you prefer a different way of handling Whisper-specific GET_ROWS acceleration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, understand what is the root issue now - same tensor being used both for GET_ROWS
and MUL_MAT
. I don't think this is the right solution. We should not make such assumptions about the tensor shapes in the underlying implementation as it is very fragile.
In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.
Do you have an estimate of how large the negative impact is in such cases?
In hindsight the logic for repacking tensors used with GET_ROWS was probably premature. The repacking mechanism was designed with a single op in mind. I.e. if the single op that uses the tensor would benefit from repacking then we allow repacking.
In contrast, for this use case, the condition to repack now depends also on the next ops in the graph. This is something we overlooked when implementing the GET_ROWS repacking support and probably have to take a step back on it and think of a better way to support these use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the shape-based check is fragile and not the right direction long term.
I have done some benchmarks. Since the repacking process happens during model load, I wasn’t able to observe any negative impact on an M4Pro when enabling GET_ROWS for LLaMA and similar models. Even the model load time didn’t show any noticeable change in llama.cpp.
Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.
./bin/llama-bench -m ./models/llama-2-7b-q4_0.gguf -ngl 0 -t 4
W/O GET_ROWS
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 85.01 ± 0.52 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 40.08 ± 0.92 |
build: d93dcc27 (6287)
W/ GET_ROWS
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp512 | 85.60 ± 0.25 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg128 | 42.49 ± 0.63 |
build: d93dcc27 (6287)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.
Ok, lets avoid the special-cases for now since the impact does not seem to be significant. If we don't find cases in which the impact is measurable, then I think it would be OK to keep the GET_ROWS repacking as it is.
This patch fixes the assert issue for the latest whisper.cpp when KleidiAI is enabled: ggml/src/ggml-backend.cpp:1088: GGML_ASSERT(*cur_backend_id != -1) failed.