-
Notifications
You must be signed in to change notification settings - Fork 162
Closed
Description
What happened?
Problem: GLM-4.6 output is complete gibberish, mostly spaces and punctuation. GLM-4.5-Air will output words, but quickly devolves into repetition.
Adding --no-rope-cache fixes this.
4.6 Model: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ3_KS
4.5 Air Model: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_KSS
nvidia-smi:
$ nvidia-smi
Mon Nov 3 14:47:11 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:04:00.0 On | N/A |
| 0% 51C P5 42W / 420W | 22531MiB / 24576MiB | 11% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:2B:00.0 Off | N/A |
| 0% 43C P8 22W / 450W | 28411MiB / 32607MiB | 16% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 Off | 00000000:2C:00.0 Off | N/A |
| 0% 29C P8 23W / 420W | 18631MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
build config:
cmake -B build \
-DCMAKE_CXX_FLAGS="-march=native -mtune=native -O3" \
-DCMAKE_C_FLAGS="-march=native -mtune=native -O3" \
-DGGML_NATIVE=ON \
-DGGML_SCHED_MAX_COPIES=1 \
-DGGML_CUDA=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 \
-DCMAKE_CUDA_ARCHITECTURES="86;120"
4.6 command:
llama-server
--no-display-prompt
--verbosity 0
-mla 2
-amb 512
--port 9999
--predict -1
--n-gpu-layers 1000
--main-gpu 0
--parallel 1
--no-warmup
--jinja
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6
--min-p 0.1
--presence-penalty 1.5
--alias glm
-rtr
-b 512
-c 100000
--override-tensor "\.(([13467][0-9])|(2[0-7])|(5[012])|(8[01]))\..*exps=CPU"
--override-tensor "blk\.(1?|2|9)[0-9]\.=CUDA0"
--override-tensor "blk\.(3|4|5)[0-9]\.=CUDA1"
--override-tensor "blk\.(6|7|8)[0-9]\.=CUDA2"
--model /mnt/store/ai/GLM4.6/GLM-4.6-IQ3_KS-00001-of-00004.gguf
Name and Version
$ build/bin/llama-server --version
version: 3946 (1cfd198)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
Metadata
Metadata
Assignees
Labels
No labels