Skip to content

Bug: RoPE Cache lobotomizes GLM on my setup #893

@kooshi

Description

@kooshi

What happened?

Problem: GLM-4.6 output is complete gibberish, mostly spaces and punctuation. GLM-4.5-Air will output words, but quickly devolves into repetition.
Adding --no-rope-cache fixes this.

4.6 Model: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ3_KS
4.5 Air Model: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_KSS

nvidia-smi:

$ nvidia-smi 
Mon Nov  3 14:47:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:04:00.0  On |                  N/A |
|  0%   51C    P5             42W /  420W |   22531MiB /  24576MiB |     11%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:2B:00.0 Off |                  N/A |
|  0%   43C    P8             22W /  450W |   28411MiB /  32607MiB |     16%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:2C:00.0 Off |                  N/A |
|  0%   29C    P8             23W /  420W |   18631MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

build config:

cmake -B build \
  -DCMAKE_CXX_FLAGS="-march=native -mtune=native -O3" \
  -DCMAKE_C_FLAGS="-march=native -mtune=native -O3" \
  -DGGML_NATIVE=ON \
  -DGGML_SCHED_MAX_COPIES=1 \
  -DGGML_CUDA=ON \
  -DBUILD_SHARED_LIBS=OFF \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 \
  -DCMAKE_CUDA_ARCHITECTURES="86;120"

4.6 command:

llama-server
--no-display-prompt
--verbosity 0
-mla 2
-amb 512
--port 9999
--predict -1
--n-gpu-layers 1000
--main-gpu 0
--parallel 1
--no-warmup
--jinja
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--min-p 0.1
--presence-penalty 1.5
--alias glm
-rtr
-b 512
-c 100000
--override-tensor "\.(([13467][0-9])|(2[0-7])|(5[012])|(8[01]))\..*exps=CPU"
--override-tensor "blk\.(1?|2|9)[0-9]\.=CUDA0"
--override-tensor "blk\.(3|4|5)[0-9]\.=CUDA1"
--override-tensor "blk\.(6|7|8)[0-9]\.=CUDA2"
--model /mnt/store/ai/GLM4.6/GLM-4.6-IQ3_KS-00001-of-00004.gguf

Name and Version

$ build/bin/llama-server --version
version: 3946 (1cfd198)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions