-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Name and Version
$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
version: 6145 (04e1626)
built with gcc-13 (SUSE Linux) 13.3.1 20250313 [revision 4ef1d8c84faeebffeb0cc01ee22e891b41e5c4e0] for x86_64-suse-linux
Operating systems
Linux
GGML backends
CUDA
Hardware
13th Gen Intel(R) Core(TM) i5-13500 & NVIDIA RTX 4000 SFF Ada Generation
Models
gpt-20B.mxfp4 ( https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-mxfp4.gguf )
Problem description & steps to reproduce
Server is started with the following arguments
$ llama-server -m gpt-20B.mxfp4.gguf \
--no-webui --numa numactl --threads 32 --ctx-size 131072 \
--n-gpu-layers 37 --no-op-offload \
--temp 0 \
--flash-attn \
--jinja --chat-template-kwargs '{"reasoning_effort": "high"}' --reasoning-format none \
--repeat-penalty 1.2
Pass in the following prompt
curl -s -N -X POST "http://localhost:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{ \"messages\": [{\"role\": \"user\", \"content\": \"# apache forward a request to two backend in order an cgi program then a php fcgi call\"}], \"stream\":true }"
Then after about ~2 minutes on this system (at around 70 tokens/second) it crashes. Memory stays steady during that time (watching over in nvidia-smi) at around 14GB (of 20GB available).
First Bad Commit
8b2aac0 is the first bad commit
(previous commit sha reported is what version I was running when starting to draft the report)
Relevant log output
systemd-coredump[22807]: [🡕] Process 22624 (llama-server) of user 1000 dumped core.
Stack trace of thread 22624:
#0 0x00000000005bd7ec n/a (/home/developer/llama.cpp/source/build/bin/llama-server + 0x1bd7ec)
ELF object binary architecture: AMD x86-64
### in gdb
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000005bd7ec in std::__detail::_Executor<std::reverse_iterator<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, st
--Type <RET> for more, q to quit, c to continue without paging--c
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.38-150600.14.32.1.x86_64 krb5-debuginfo-1.20.1-150600.11.11.2.x86_64 libcom_err2-debuginfo-1.47.0-150600.4.6.2.x86_64 libcurl4-debuginfo-8.6.0-150600.4.21.1.x86_64 libgcc_s1-debuginfo-14.2.0+git10526-150000.1.6.1.x86_64 libgomp1-debuginfo-14.2.0+git10526-150000.1.6.1.x86_64 libopenssl3-debuginfo-3.1.4-150600.5.33.1.x86_64 libsasl2-3-debuginfo-2.1.28-150600.7.6.2.x86_64 libssh4-debuginfo-0.9.8-150600.11.3.1.x86_64 libstdc++6-debuginfo-14.2.0+git10526-150000.1.6.1.x86_64 nvidia-compute-G06-debuginfo-580.65.06-1.x86_64d::allocator<std::__cxx11::sub_match<std::reverse_iterator<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<std::reverse_iterator<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::allocator<std::__cxx11::sub_match<std::reverse_iterator<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long) ()
[Current thread is 1 (Thread 0x7f395d63d000 (LWP 22624))]