Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Sep 2, 2025

This Pull Request fixes #14877 and #15721 whereby compiling with -DGGML_NNPA causes gibberish output with more than 4 threads. This code change also include automatic disabling of Flash Attention when building with -DGGML_NNPA.

Verification

To ensure that the inference is now correct with -DGGML_NNPA, the following tests have been done:

  1. threads=1 (OK)
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 1 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target
  1. threads=2 (OK)
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 2 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target
  1. threads=4 (OK)
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 4 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target
  1. threads=8 (OK)
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 8 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target
  1. threads=16 (OK)
$ build/bin/llama-cli -m /devfield/taronaeo/hf_models/granite-3.3-2b-instruct-be.F32.gguf -t 16 -n 25 -p "Write me a dog walking business idea 1. " -no-cnv -ngl -1 --seed 1568795874 --ctx-size 16384

Write me a dog walking business idea 1. 
What is the name of the business?
2. What services does it offer?
3. Who are the target

Performance Results

model size params backend threads test t/s master t/s PR speedup
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 pp256 29.87 28.5 0.95
granite 3B all F32 9.44 GiB 2.53 B BLAS 4 tg64 1.84 1.84 1.00
granite 3B F16 4.72 GiB 2.53 B BLAS 4 pp256 25.15 24.97 0.99
granite 3B F16 4.72 GiB 2.53 B BLAS 4 tg64 1.47 1.92 1.31

Note

Tests were conducted on an IBM z16 Mainframe with 2 IFLs and 64 GB Memory on a shared z/VM environment.

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Sep 2, 2025
Comment on lines 3527 to 3535
int ggml_cpu_support_fattn(void) {
#if defined(GGML_NNPA) || defined(__NNPA__)
// disable Flash Attention when using NNPA
// see: https://github.com/ggml-org/llama.cpp/issues/15721
return 0;
#else
return 1;
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not going to work. Either the problem with NNPA should be fixed, or NNPA support removed.

Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
Signed-off-by: Aaron Teo <[email protected]>
@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 5, 2025

I'm a little bit at a lost here. I've double checked the NNPA FP32 to FP16 code and verified that it is actually correct. The thing causing the failure in FP32->FP16 conversion is actually because ggml_compute_forward_dup_f32 is calling ggml_cpu_fp32_to_fp16 with -inf or nan data for some reason with Flash Attention turned on.

That's as far as I have gotten to identify the failure point and I think it would be easier to disable FP32->FP16 NNPA conversion for now, while we leave FP16->FP32 as-is as it is still correct and functioning.

-fa off vs -fa on -inf
image

-fa off vs -fa on nan
image

@taronaeo
Copy link
Collaborator Author

taronaeo commented Sep 5, 2025

Superseded by #15821

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

2 participants