Skip to content

Eval bug: ggml-cpu Conversion FP32<->FP16 Using GGML_NNPA Stop Inferencing Correctly After b6324 #15721

@taronaeo

Description

@taronaeo

Name and Version

build/bin/llama-cli --version
version: 6344 (02c1813)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux

Operating systems

Linux

GGML backends

CPU, BLAS

Hardware

IBM z17, 40 IFLs, 128 GB Memory, NNPA, zDNN

Models

Granite 3.3 2B Instruct Big-Endian F32

Problem description & steps to reproduce

Starting from commit e81b8e4, ggml_cpu_fp32_to_fp16 and ggml_cpu_fp16_to_fp32 fails FP32<->FP16 conversion and consistently output 4444444... with the help of the IBM NNPA co-processor.

All commits prior to e81b8e4 worked as intended. I'm unsure what is in that commit that caused it to fail.

First Bad Commit

$ git bisect good

e81b8e4b7f5ab870836fad26d154a7507b341b36 is the first bad commit
commit e81b8e4b7f5ab870836fad26d154a7507b341b36 (tag: b6325)
Author: Johannes Gäßler <[email protected]>
Date:   Sat Aug 30 16:32:10 2025 +0200

    llama: use FA + max. GPU layers by default (#15434)
    
    * llama: use max. GPU layers by default, auto -fa
    
    * ggml-backend: abort instead of segfault

 common/arg.cpp                              | 30 ++++++++++++------------------
 common/common.cpp                           |  8 +++++---
 common/common.h                             |  2 +-
 examples/diffusion/diffusion-cli.cpp        |  2 +-
 ggml/src/ggml-backend.cpp                   | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/llama.h                             | 10 +++++++++-
 scripts/server-bench.py                     |  6 ------
 scripts/tool_bench.py                       |  2 +-
 src/llama-context.cpp                       | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 src/llama-graph.cpp                         | 24 +++++++++++++++++++-----
 src/llama-graph.h                           |  3 ++-
 src/llama-impl.h                            |  2 ++
 src/llama-model.cpp                         |  7 +------
 src/llama.cpp                               | 12 ++++++++++++
 tools/batched-bench/batched-bench.cpp       |  4 ++--
 tools/llama-bench/llama-bench.cpp           | 20 ++++++++++----------
 tools/server/tests/unit/test_ctx_shift.py   | 15 ++++++++-------
 tools/server/tests/unit/test_speculative.py |  1 +
 tools/server/tests/utils.py                 |  6 +++---
 19 files changed, 235 insertions(+), 72 deletions(-)

Relevant log output

Tag #b6324 (Working)

$ build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384; build/bin/llama-cli --version

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

llama_perf_sampler_print:    sampling time =       0.99 ms /    36 runs   (    0.03 ms per token, 36326.94 tokens per second)
llama_perf_context_print:        load time =     760.19 ms
llama_perf_context_print: prompt eval time =     466.20 ms /    11 tokens (   42.38 ms per token,    23.60 tokens per second)
llama_perf_context_print:        eval time =    4114.43 ms /    24 runs   (  171.43 ms per token,     5.83 tokens per second)
llama_perf_context_print:       total time =    4607.71 ms /    35 tokens
llama_perf_context_print:    graphs reused =         22
version: 6324 (38ad381f9)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux


Tag #b6325 (Not Working)

$ build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384; build/bin/llama-cli --version

Write me a dog walking business idea 1. 4444444444444444444444444

llama_perf_sampler_print:    sampling time =       0.86 ms /    36 runs   (    0.02 ms per token, 41811.85 tokens per second)
llama_perf_context_print:        load time =     834.84 ms
llama_perf_context_print: prompt eval time =     195.36 ms /    11 tokens (   17.76 ms per token,    56.31 tokens per second)
llama_perf_context_print:        eval time =    3889.00 ms /    24 runs   (  162.04 ms per token,     6.17 tokens per second)
llama_perf_context_print:       total time =    4111.33 ms /    35 tokens
llama_perf_context_print:    graphs reused =         23
version: 6325 (e81b8e4b7)
built with gcc (GCC) 15.1.0 for s390x-redhat-linux


$ gdb --args build/bin/llama-cli -m granite-3.3-2b-instruct-be.F32.gguf -t 40 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

llama-cli: llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181: void ggml_compute_forward_rms_norm_f32(const ggml_compute_params*, ggml_tensor*): Assertion `scale > 0.0f' failed.
llama-cli: llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181: void ggml_compute_forward_rms_norm_f32(const ggml_compute_params*, ggml_tensor*): Assertion `scale > 0.0f' failed.

Thread 42 "llama-cli" received signal SIGABRT, Aborted.
[Switching to Thread 0x3fe92948840 (LWP 2528089)]
0x000003fff6b98c26 in __pthread_kill_implementation () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-168.el9_6.23.s390x

(gdb) bt
#0  0x000003fff6b98c26 in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x000003fff6b49010 in raise () from /lib64/libc.so.6
#2  0x000003fff6b2a390 in abort () from /lib64/libc.so.6
#3  0x000003fff6b40522 in __assert_fail_base () from /lib64/libc.so.6
#4  0x000003fff6b4083e in __assert_fail () from /lib64/libc.so.6
#5  0x000003fff7588730 in ggml_compute_forward_rms_norm_f32 (params=0x3fe92938cb0, dst=0x16d2110)
    at llama.cpp/ggml/src/ggml-cpu/ops.cpp:4181
#6  0x000003fff7588800 in ggml_compute_forward_rms_norm (params=0x3fe92938cb0, dst=0x16d2110) at llama.cpp/ggml/src/ggml-cpu/ops.cpp:4198
#7  0x000003fff751c7dc in ggml_compute_forward (params=0x3fe92938cb0, tensor=0x16d2110) at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1765
#8  0x000003fff751ecfc in ggml_graph_compute_thread (data=0x1495010) at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2871
#9  0x000003fff752000a in ggml_graph_compute._omp_fn.0 () at llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3146
#10 0x000003fff6da33ee in gomp_thread_start (xdata=<optimized out>) at ../../../gcc-15.1.0-src/libgomp/team.c:129
#11 0x000003fff6b96da6 in start_thread () from /lib64/libc.so.6
#12 0x000003fff6c1008e in thread_start () from /lib64/libc.so.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions