-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
$ sources/llama.cpp/build/bin/llama-server --version
version: 5435 (a4090d1)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
Intel(R) Core(TM) Ultra 5 245KF + Radeon RX 7900 XTX, gfx1100 (0x1100)
Models
gemma-3-27b-it-qat-UD-Q4_K_XL.gguf + gemma-3-27b-it-qat-GGUF/mmproj-F16.gguf
Problem description & steps to reproduce
Built with (corrected command):
cmake -S . -B build -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release && cmake --build build -- -j 16
Command used to start:
sources/llama.cpp/build/bin/llama-server --port 9001 -c 65536 -ctv q8_0 -ctk q8_0 --no-warmup -ngl 99 -fa -m models/unsloth/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-UD-Q4_K_XL.gguf --mmproj models/unsloth/gemma-3-27b-it-qat-GGUF/mmproj-F16.gguf --jinja
This is the first time I used this model. I accessed the API via openwebui hosted in a docker container. Normal text only chat, nothing special. I can share the chat contents privately - it crashes every time when I try to regenerate the last message. I recompiled with -DCMAKE_BUILD_TYPE=Debug and it was reproducible, so the full debug log is attached.
The memory utilization is 22052M our of 24560M, nothing else is using the GPU.
Error looks similar to #12045
First Bad Commit
No response
Relevant log output
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 4076, n_tokens = 2028, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 4076, n_tokens = 2028
/home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (cache_k_l32 (view) (copy of cache_k_l32 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)
[New LWP 1580837]
[New LWP 1580836]
[New LWP 1580835]
[New LWP 1580834]
[New LWP 1580833]
[New LWP 1580832]
[New LWP 1580831]
[New LWP 1580830]
[New LWP 1580829]
[New LWP 1580828]
[New LWP 1580827]
[New LWP 1580826]
[New LWP 1580825]
[New LWP 1580824]
[New LWP 1580823]
[New LWP 1580822]
[New LWP 1580821]
[New LWP 1580819]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/liblber.so.2
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlidec.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libbrotlicommon.so.1
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_intel.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_virtio.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/x86_64-linux-gnu/libvulkan_nouveau.so
warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007587595107e3 in __GI___wait4 (pid=1580930, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x00007587595107e3 in __GI___wait4 (pid=1580930, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x000075875b3c1682 in ggml_print_backtrace () at /home/wizz/sources/llama.cpp/ggml/src/ggml.c:194
194 waitpid(child_pid, NULL, 0);
#2 0x000075875b3c17b5 in ggml_abort (file=0x75875b4410a0 "/home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp", line=750, fmt=0x75875b4414f0 "pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)") at /home/wizz/sources/llama.cpp/ggml/src/ggml.c:215
215 ggml_print_backtrace();
#3 0x000075875b3da02f in ggml_backend_sched_backend_id_from_cur (sched=0x5fb9db9024d0, tensor=0x5fb9dbcebd00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:750
750 GGML_ABORT("pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)", tensor->name, ggml_backend_buffer_name(buffer), ggml_op_name(tensor->op));
#4 0x000075875b3da9c4 in ggml_backend_sched_split_graph (sched=0x5fb9db9024d0, graph=0x5fb9dbad8d00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:899
899 *node_backend_id = ggml_backend_sched_backend_id_from_cur(sched, node);
#5 0x000075875b3dde02 in ggml_backend_sched_alloc_graph (sched=0x5fb9db9024d0, graph=0x5fb9dbad8d00) at /home/wizz/sources/llama.cpp/ggml/src/ggml-backend.cpp:1565
1565 ggml_backend_sched_split_graph(sched, graph);
#6 0x000075875b9e6308 in llama_kv_cache_unified::update (this=0x5fb9db909230, lctx=...) at /home/wizz/sources/llama.cpp/src/llama-kv-cache.cpp:427
427 ggml_backend_sched_alloc_graph(sched, gf);
#7 0x000075875b9eb5ee in llama_kv_cache_unified_iswa::update (this=0x5fb9db9071e0, lctx=...) at /home/wizz/sources/llama.cpp/src/llama-kv-cache.cpp:1761
1761 res = res & kv_swa ->update(lctx);
#8 0x000075875b976c4d in llama_context::kv_self_update (this=0x5fb9d4dfc2b0) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:457
457 need_reserve = kv_self->update(*this);
#9 0x000075875b97931e in llama_context::decode (this=0x5fb9d4dfc2b0, inp_batch=...) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:926
926 kv_self_update();
#10 0x000075875b97fb11 in llama_decode (ctx=0x5fb9d4dfc2b0, batch=...) at /home/wizz/sources/llama.cpp/src/llama-context.cpp:2545
2545 int ret = ctx->decode(batch);
#11 0x00005fb99b780b7b in server_context::update_slots (this=0x7ffdc33254d0) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:3358
3358 ret = llama_decode(ctx, batch_view);
#12 0x00005fb99b725035 in operator() (__closure=0x7ffdc3326aa8) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:4849
4849 ctx_server.update_slots();
#13 0x00005fb99b7335c4 in std::__invoke_impl<void, main(int, char**)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
61 { return std::forward<_Fn>(__f)(std::forward<_Args>(__args)...); }
#14 0x00005fb99b731494 in std::__invoke_r<void, main(int, char**)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/13/bits/invoke.h:111
111 std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
#15 0x00005fb99b72d6bb in std::_Function_handler<void(), main(int, char**)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/13/bits/std_function.h:290
290 return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
#16 0x00005fb99b786d36 in std::function<void ()>::operator()() const (this=0x7ffdc3326aa8) at /usr/include/c++/13/bits/std_function.h:591
591 return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#17 0x00005fb99b772b83 in server_queue::start_loop (this=0x7ffdc3326988) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:1681
1681 callback_update_slots();
#18 0x00005fb99b7278a0 in main (argc=22, argv=0x7ffdc3326d48) at /home/wizz/sources/llama.cpp/tools/server/server.cpp:4874
4874 ctx_server.queue_tasks.start_loop();
[Inferior 1 (process 1580786) detached]