Skip to content

Bug: Perplexity returns NaN with IQ4_KSS quantisation #245

@davidsyoung

Description

@davidsyoung

What happened?

I said I would open a separate issue for this instead of discussing under an irrelevant pull request - let me know if you'd rather me continue over there @ikawrakow.

So I have tracked down the bug with llama-perplexity returning NaN's. To be clear, this is with IQ4_KSS quantisation. I have ran ``llama-perplexity` with IQ3_M without any issues. Which, was also made with the same imatrix.dat.

The command that works under IQ3_M is as follows:

./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ3_M.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 2048 --n-gpu-layers 100

I tried to initially replicate this across to IQ4_KSS, but it started to produce NaNs. From there, I tested no attention, mla, different combinations, etc to no prevail. Here are some combinations that were tested that produced NaNs:


-fa -ub 1024 -ot ... = NaN

root@887d1e7c1690:/app# ./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -fa \
  -c 2048 \
  -ub 1024 \
  -ngl 100 \
  -ot ...

...

perplexity: tokenizing the input ..
perplexity: tokenization took 1252.89 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 15.37 seconds per pass - ETA 35.85 minutes
[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan

-mla 2 -ub 512 --seed --temp --amb -ot ... = NaN

./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -mla 2 \
  -c 2048 \
  -ub 512 \
  -ngl 100 \
  --seed 3407 \
  --temp 0.5 \
  -amb 64 \
  -ot ... \
  -ts 24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 1231.71 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 22.04 seconds per pass - ETA 51.43 minutes
[1]nan,[2]nan,^C^C

-fa -ub 8 --seed --temp --amb 64 -ot = Works!

./llama-perplexity \
  -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf \
  -f /models/wiki.test.raw \
  -fa \
  -c 2048 \
  -ub 8 \
  -ngl 100 \
  --seed 3407 \
  --temp 0.5 \
  -amb 64 \
  -ot ...
  -ts 24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 1211.1 ms
perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
perplexity: 69.34 seconds per pass - ETA 2 hours 41.78 minutes
[1]1.5140,[2]1.2829,[3]1.2362,[4]1.6902,[5]1.7468,[6]1.7194,[7]1.8258,[8]1.9479,[9]2.1370,[10]2.3270,[11]2.4503,[12]2.3282,[13]2.4525,[14]2.5484,[15]2.6761,[16]2.7952,[17]2.7793,[18]2.8372,[19]2.7767,[20]2.6981,[21]2.6288,[22]2.5562,[23]2.4682,[24]2.4149

I figured it out when I read your comment here: #103 (comment)

This quant was created with the following (I requanted the BF16-GGUF and this IQ4_KSS to be certain it wasn't a quantisation issue, but it could be the types here, namely IQ4_KSS possibly):

./llama-quantize --imatrix /models/deepseek-config/imatrix.dat  --token-embedding-type q8_0 /storage/DeepSeek-R1-GGUF/unsloth_DeepSeek-R1-BF16-256x21B-F16-00001-of-00059.gguf /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS-v2.gguf IQ4_KSS 64

The imatrix.dat is from https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF from @schmorp.


I then decided to rebuild with GGML_CUDA_FORCE_MMQ / LLAMA_CUDA_FORCE_MMQ set, and then run to see if that would resolve with a higher -ub size.

Unfortunately, no - produced NaNs.

Hopefully this is enough information for you to be able to possibly see what the issue is!

Name and Version

main: build = 0 (unknown)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 3407

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions