Eval bug: llama.cpp/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (cache_k_l32 (view) (copy of cache_k_l32 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)

### Name and Version

$ sources/llama.cpp/build/bin/llama-server --version
version: 5435 (a4090d11)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

Intel(R) Core(TM) Ultra 5 245KF + Radeon RX 7900 XTX, gfx1100 (0x1100)

### Models

gemma-3-27b-it-qat-UD-Q4_K_XL.gguf + gemma-3-27b-it-qat-GGUF/mmproj-F16.gguf

### Problem description & steps to reproduce

Built with (corrected command):
`cmake -S . -B build -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release && cmake --build build -- -j 16`

Command used to start:
`sources/llama.cpp/build/bin/llama-server --port 9001 -c 65536 -ctv q8_0 -ctk q8_0 --no-warmup -ngl 99 -fa -m models/unsloth/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-UD-Q4_K_XL.gguf --mmproj models/unsloth/gemma-3-27b-it-qat-GGUF/mmproj-F16.gguf --jinja`

This is the first time I used this model. I accessed the API via openwebui hosted in a docker container. Normal text only chat, nothing special. I can share the chat contents privately - it crashes every time when I try to regenerate the last message. I recompiled with -DCMAKE_BUILD_TYPE=Debug and it was reproducible, so the full debug log is attached.

The memory utilization is 22052M our of 24560M, nothing else is using the GPU.

Error looks similar to https://github.com/ggml-org/llama.cpp/issues/12045

[llama.cpp-debug.log](https://github.com/user-attachments/files/20368027/llama.cpp-debug.log)

### First Bad Commit

_No response_

### Relevant log output

```shell
slot update_slots: id  0 | task 0 | kv cache rm [2048, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4076, n_tokens = 2028, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 4076, n_tokens = 2028
/home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (cache_k_l32 (view) (copy of cache_k_l32 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)
[New LWP 1580837]
[New LWP 1580836]
[New LWP 1580835]
[New LWP 1580834]
[New LWP 1580833]
[New LWP 1580832]
[New LWP 1580831]
[New LWP 1580830]
[New LWP 1580829]
[New LWP 1580828]
[New LWP 1580827]
[New LWP 1580826]
[New LWP 1580825]
[New LWP 1580824]
[New LWP 1580823]
[New LWP 1580822]
[New LWP 1580821]
[New LWP 1580819]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/liblber.so.2
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlidec.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlicommon.so.1
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_intel.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_virtio.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_nouveau.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007587595107e3 in __GI___wait4 (pid=1580930, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x00007587595107e3 in __GI___wait4 (pid=1580930, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000075875b3c1682 in ggml_print_backtrace () at /home/wizz/sources/llama.cpp/ggml/src/ggml.c:194
194             waitpid(child_pid, NULL, 0);
#2  0x000075875b3c17b5 in ggml_abort (file=0x75875b4410a0 "/home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp", line=750, fmt=0x75875b4414f0 "pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)") at /home/wizz/sources/llama.cpp/ggml/src/ggml.c:215
215         ggml_print_backtrace();
#3  0x000075875b3da02f in ggml_backend_sched_backend_id_from_cur (sched=0x5fb9db9024d0, tensor=0x5fb9dbcebd00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:750
750             GGML_ABORT("pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)", tensor->name, ggml_backend_buffer_name(buffer), ggml_op_name(tensor->op));
#4  0x000075875b3da9c4 in ggml_backend_sched_split_graph (sched=0x5fb9db9024d0, graph=0x5fb9dbad8d00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:899
899                 *node_backend_id = ggml_backend_sched_backend_id_from_cur(sched, node);
#5  0x000075875b3dde02 in ggml_backend_sched_alloc_graph (sched=0x5fb9db9024d0, graph=0x5fb9dbad8d00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:1565
1565        ggml_backend_sched_split_graph(sched, graph);
#6  0x000075875b9e6308 in llama_kv_cache_unified::update (this=0x5fb9db909230, lctx=...) at /home/wizz/sources/llama.cpp/src/llama-kv-cache.cpp:427
427                 ggml_backend_sched_alloc_graph(sched, gf);
#7  0x000075875b9eb5ee in llama_kv_cache_unified_iswa::update (this=0x5fb9db9071e0, lctx=...) at /home/wizz/sources/llama.cpp/src/llama-kv-cache.cpp:1761
1761        res = res & kv_swa ->update(lctx);
#8  0x000075875b976c4d in llama_context::kv_self_update (this=0x5fb9d4dfc2b0) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:457
457         need_reserve = kv_self->update(*this);
#9  0x000075875b97931e in llama_context::decode (this=0x5fb9d4dfc2b0, inp_batch=...) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:926
926         kv_self_update();
#10 0x000075875b97fb11 in llama_decode (ctx=0x5fb9d4dfc2b0, batch=...) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:2545
2545        int ret = ctx->decode(batch);
#11 0x00005fb99b780b7b in server_context::update_slots (this=0x7ffdc33254d0) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:3358
3358                    ret = llama_decode(ctx, batch_view);
#12 0x00005fb99b725035 in operator() (__closure=0x7ffdc3326aa8) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:4849
4849            ctx_server.update_slots();
#13 0x00005fb99b7335c4 in std::__invoke_impl<void, main(int, char**)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
61          { return std::forward<_Fn>(__f)(std::forward<_Args>(__args)...); }
#14 0x00005fb99b731494 in std::__invoke_r<void, main(int, char**)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/13/bits/invoke.h:111
111             std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
#15 0x00005fb99b72d6bb in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/13/bits/std_function.h:290
290             return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
#16 0x00005fb99b786d36 in std::function<void ()>::operator()() const (this=0x7ffdc3326aa8) at /usr/include/c++/13/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#17 0x00005fb99b772b83 in server_queue::start_loop (this=0x7ffdc3326988) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:1681
1681                callback_update_slots();
#18 0x00005fb99b7278a0 in main (argc=22, argv=0x7ffdc3326d48) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:4874
4874        ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 1580786) detached]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama.cpp/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (cache_k_l32 (view) (copy of cache_k_l32 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY) #13684

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama.cpp/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (cache_k_l32 (view) (copy of cache_k_l32 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY) #13684

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions