Bug: Perplexity returns NaN with IQ4_KSS quantisation

### What happened?

I said I would open a separate issue for this instead of discussing under an irrelevant pull request - let me know if you'd rather me continue over there @ikawrakow.

So I have tracked down the bug with `llama-perplexity` returning NaN's. To be clear, this is with IQ4_KSS quantisation. I have ran ``llama-perplexity` with IQ3_M without any issues. Which, was also made with the same imatrix.dat.

The command that works under IQ3_M is as follows:

```
./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ3_M.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 2048 --n-gpu-layers 100
```
---


I tried to initially replicate this across to IQ4_KSS, but it started to produce NaNs. From there, I tested no attention, mla, different combinations, etc to no prevail. Here are some combinations that were tested that produced NaNs: 

---

# -fa -ub 1024 -ot ... = NaN

```
root@887d1e7c1690:/app# ./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -fa \
  -c 2048 \
  -ub 1024 \
  -ngl 100 \
  -ot ...

...

perplexity: tokenizing the input ..
perplexity: tokenization took 1252.89 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 15.37 seconds per pass - ETA 35.85 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan
```

---

# -mla 2  -ub 512 --seed --temp --amb -ot ... = NaN

```
./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -mla 2 \
  -c 2048 \
  -ub 512 \
  -ngl 100 \
  --seed 3407 \
  --temp 0.5 \
  -amb 64 \
  -ot ... \
  -ts 24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 1231.71 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 22.04 seconds per pass - ETA 51.43 minutes
[1]nan,[2]nan,^C^C
```

---

# -fa -ub 8 --seed --temp --amb 64 -ot = Works!

```
./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -fa \
  -c 2048 \
  -ub 8 \
  -ngl 100 \
  --seed 3407 \
  --temp 0.5 \
  -amb 64 \
  -ot ...
  -ts 24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 1211.1 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 69.34 seconds per pass - ETA 2 hours 41.78 minutes
[1]1.5140,[2]1.2829,[3]1.2362,[4]1.6902,[5]1.7468,[6]1.7194,[7]1.8258,[8]1.9479,[9]2.1370,[10]2.3270,[11]2.4503,[12]2.3282,[13]2.4525,[14]2.5484,[15]2.6761,[16]2.7952,[17]2.7793,[18]2.8372,[19]2.7767,[20]2.6981,[21]2.6288,[22]2.5562,[23]2.4682,[24]2.4149
```

---

I figured it out when I read your comment here: https://github.com/ikawrakow/ik_llama.cpp/issues/103#issuecomment-2434735396

This quant was created with the following (I requanted the BF16-GGUF and this IQ4_KSS to be certain it wasn't a quantisation issue, but it could be the types here, namely IQ4_KSS possibly):

```
./llama-quantize --imatrix /models/deepseek-config/imatrix.dat  --token-embedding-type q8_0 /storage/DeepSeek-R1-GGUF/unsloth_DeepSeek-R1-BF16-256x21B-F16-00001-of-00059.gguf /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf IQ4_KSS 64
```

The `imatrix.dat` is from https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF from @schmorp.

---

I then decided to rebuild with `GGML_CUDA_FORCE_MMQ` / `LLAMA_CUDA_FORCE_MMQ` set, and then run to see if that would resolve with a higher `-ub` size. 

Unfortunately, no - produced NaNs.

Hopefully this is enough information for you to be able to possibly see what the issue is!

### Name and Version

main: build = 0 (unknown)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 3407

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Perplexity returns NaN with IQ4_KSS quantisation #245

What happened?

-fa -ub 1024 -ot ... = NaN

-mla 2 -ub 512 --seed --temp --amb -ot ... = NaN

-fa -ub 8 --seed --temp --amb 64 -ot = Works!

Name and Version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Perplexity returns NaN with IQ4_KSS quantisation #245

Description

What happened?

-fa -ub 1024 -ot ... = NaN

-mla 2 -ub 512 --seed --temp --amb -ot ... = NaN

-fa -ub 8 --seed --temp --amb 64 -ot = Works!

Name and Version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions