-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Name and Version
build/bin/llama-cli --version
version: 6344 (02c1813)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux
Operating systems
Linux
GGML backends
CPU, BLAS
Hardware
IBM z17, 40 IFLs, 128 GB Memory, NNPA, zDNN
Models
Granite 3.3 2B Instruct Big-Endian F32
Problem description & steps to reproduce
Starting from commit e81b8e4, ggml_cpu_fp32_to_fp16
and ggml_cpu_fp16_to_fp32
fails FP32<->FP16 conversion and consistently output 4444444...
with the help of the IBM NNPA co-processor.
All commits prior to e81b8e4 worked as intended. I'm unsure what is in that commit that caused it to fail.
First Bad Commit
$ git bisect good
e81b8e4b7f5ab870836fad26d154a7507b341b36 is the first bad commit
commit e81b8e4b7f5ab870836fad26d154a7507b341b36 (tag: b6325)
Author: Johannes Gäßler <[email protected]>
Date: Sat Aug 30 16:32:10 2025 +0200
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa
* ggml-backend: abort instead of segfault
common/arg.cpp | 30 ++++++++++++------------------
common/common.cpp | 8 +++++---
common/common.h | 2 +-
examples/diffusion/diffusion-cli.cpp | 2 +-
ggml/src/ggml-backend.cpp | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/llama.h | 10 +++++++++-
scripts/server-bench.py | 6 ------
scripts/tool_bench.py | 2 +-
src/llama-context.cpp | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
src/llama-graph.cpp | 24 +++++++++++++++++++-----
src/llama-graph.h | 3 ++-
src/llama-impl.h | 2 ++
src/llama-model.cpp | 7 +------
src/llama.cpp | 12 ++++++++++++
tools/batched-bench/batched-bench.cpp | 4 ++--
tools/llama-bench/llama-bench.cpp | 20 ++++++++++----------
tools/server/tests/unit/test_ctx_shift.py | 15 ++++++++-------
tools/server/tests/unit/test_speculative.py | 1 +
tools/server/tests/utils.py | 6 +++---
19 files changed, 235 insertions(+), 72 deletions(-)
Relevant log output
Tag #b6324 (Working)
$ build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384; build/bin/llama-cli --version
Write me a dog walking business idea 1.
What is the name of the business?
2. What services does it offer?
3. Who are the target
llama_perf_sampler_print: sampling time = 0.99 ms / 36 runs ( 0.03 ms per token, 36326.94 tokens per second)
llama_perf_context_print: load time = 760.19 ms
llama_perf_context_print: prompt eval time = 466.20 ms / 11 tokens ( 42.38 ms per token, 23.60 tokens per second)
llama_perf_context_print: eval time = 4114.43 ms / 24 runs ( 171.43 ms per token, 5.83 tokens per second)
llama_perf_context_print: total time = 4607.71 ms / 35 tokens
llama_perf_context_print: graphs reused = 22
version: 6324 (38ad381f9)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux
Tag #b6325 (Not Working)
$ build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384; build/bin/llama-cli --version
Write me a dog walking business idea 1. 4444444444444444444444444
llama_perf_sampler_print: sampling time = 0.86 ms / 36 runs ( 0.02 ms per token, 41811.85 tokens per second)
llama_perf_context_print: load time = 834.84 ms
llama_perf_context_print: prompt eval time = 195.36 ms / 11 tokens ( 17.76 ms per token, 56.31 tokens per second)
llama_perf_context_print: eval time = 3889.00 ms / 24 runs ( 162.04 ms per token, 6.17 tokens per second)
llama_perf_context_print: total time = 4111.33 ms / 35 tokens
llama_perf_context_print: graphs reused = 23
version: 6325 (e81b8e4b7)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux
$ gdb --args build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384
llama-cli: llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181: void ggml_compute_forward_rms_norm_f32(const ggml_compute_params*, ggml_tensor*): Assertion `scale > 0.0f' failed.
llama-cli: llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181: void ggml_compute_forward_rms_norm_f32(const ggml_compute_params*, ggml_tensor*): Assertion `scale > 0.0f' failed.
Thread 42 "llama-cli" received signal SIGABRT, Aborted.
[Switching to Thread 0x3fe92948840 (LWP 2528089)]
0x000003fff6b98c26 in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-168.el9_6.23.s390x
(gdb) bt
#0 0x000003fff6b98c26 in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x000003fff6b49010 in raise () from /lib64/libc.so.6
#2 0x000003fff6b2a390 in abort () from /lib64/libc.so.6
#3 0x000003fff6b40522 in __assert_fail_base () from /lib64/libc.so.6
#4 0x000003fff6b4083e in __assert_fail () from /lib64/libc.so.6
#5 0x000003fff7588730 in ggml_compute_forward_rms_norm_f32 (params=0x3fe92938cb0, dst=0x16d2110)
at llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181
#6 0x000003fff7588800 in ggml_compute_forward_rms_norm (params=0x3fe92938cb0, dst=0x16d2110) at llama.cpp/ggml/src/ggml-cpu/ops.cpp:4198
#7 0x000003fff751c7dc in ggml_compute_forward (params=0x3fe92938cb0, tensor=0x16d2110) at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1765
#8 0x000003fff751ecfc in ggml_graph_compute_thread (data=0x1495010) at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2871
#9 0x000003fff752000a in ggml_graph_compute._omp_fn.0 () at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3146
#10 0x000003fff6da33ee in gomp_thread_start (xdata=<optimized out>) at ../../../gcc-15.1.0-src/libgomp/team.c:129
#11 0x000003fff6b96da6 in start_thread () from /lib64/libc.so.6
#12 0x000003fff6c1008e in thread_start () from /lib64/libc.so.6
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working