Skip to content

Bug: Qwen3.5-35B-A3B-UD-Q6_K_XL - Unexpected empty grammar stack #1420

@milpster

Description

@milpster

What happened?

I keep getting the following crash with Qwen3.5-35B-A3B-UD-Q6_K_XL. This is the command i use:

GGML_CUDA_GRAPH_OPT=1 USE_MLOCK=true /mnt2/srcds/ai/ik_llama.cpp/build/bin/llama-server --port 8009 -m /mnt2/srcds/ai/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf --ctx-size 262144 --threads-batch 11 --threads-draft 8 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 0.0 --repeat-penalty 1.0 --jinja --no-mmap -fa on -khad -rtr -gr -ger -ngl 333 -b 1024 -ub 1024 -ot .ffn_.*_exps.=CPU -amb 256 --ctx-checkpoints 500 -mqkv -cram -1 --cache-type-k q8_0

Disabling ctx checkpoints will fix the issue, but of course reprocessing the prompt every time makes it unusable.

Here is the log:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Laptop GPU, compute capability 8.6, VMM: yes, VRAM: 7840 MiB
INFO [ main] build info | tid="140609950765056" timestamp=1773409944 build=4283 commit="714329f4"
INFO [ main] system info | tid="140609950765056" timestamp=1773409944 n_threads=8 n_threads_batch=11 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CUDA0: using device CUDA0 - 7610 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /mnt2/srcds/ai/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 35B-A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 18
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
llama_model_loader: - type f32: 301 tensors
llama_model_loader: - type f16: 90 tensors
llama_model_loader: - type q8_0: 264 tensors
llama_model_loader: - type q6_K: 78 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35moe
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 4096
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 32
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = 35B.A3B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 34.661 B
llm_load_print_meta: model size = 29.859 GiB (7.400 BPW)
llm_load_print_meta: repeating layers = 28.853 GiB (7.367 BPW, 33.643 B parameters)
llm_load_print_meta: general.name = Qwen3.5-35B-A3B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 0.63 MiB
Tensor blk.0.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.0.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.0.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 27804.00 MiB
llm_load_tensors: CUDA_Host buffer size = 515.31 MiB
llm_load_tensors: CUDA0 buffer size = 2256.30 MiB
.................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
============ Repacked 120 tensors
llama_init_from_model: n_ctx = 262144
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 1024
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 256
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 1
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3982.82 MiB
llama_init_from_model: KV self size = 3920.00 MiB, K (q8_0): 1360.00 MiB, V (f16): 2560.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 978.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 601.03 MiB
llama_init_from_model: graph nodes = 2785
llama_init_from_model: graph splits = 82
llama_init_from_model: enabling only_active_experts scheduling
INFO [ init] initializing slots | tid="140609950765056" timestamp=1773409953 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
INFO [ init] new slot | tid="140609950765056" timestamp=1773409953 id_slot=0 n_ctx_slot=262144
no implementations specified for speculative decoding
slot init: id 0 | task -1 | speculative decoding context not initialized
prompt cache is enabled, size limit: no limit
use --cache-ram 0 to disable the prompt cache
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv init: init: chat template, thinking = 1
INFO [ main] model loaded | tid="140609950765056" timestamp=1773409953
INFO [ main] HTTP server listening | tid="140609950765056" timestamp=1773409953 n_threads_http="15" port="8009" hostname="127.0.0.1"
INFO [ slots_idle] all slots are idle | tid="140609950765056" timestamp=1773409953
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
Recurrent model does not support banned strings.
INFO [ launch_slot_with_task] slot is processing task | tid="140609950765056" timestamp=1773409953 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409953 id_slot=0 id_task=0 p0=0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 500 (pos_min = 1023, pos_max = 1023, size = 62.822 MiB, took 20.97 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409956 id_slot=0 id_task=0 p0=1024
slot create_check: id 0 | task 0 | created context checkpoint 2 of 500 (pos_min = 2047, pos_max = 2047, size = 62.830 MiB, took 20.71 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409960 id_slot=0 id_task=0 p0=2048
slot create_check: id 0 | task 0 | created context checkpoint 3 of 500 (pos_min = 3071, pos_max = 3071, size = 62.837 MiB, took 20.54 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409963 id_slot=0 id_task=0 p0=3072
slot create_check: id 0 | task 0 | created context checkpoint 4 of 500 (pos_min = 4095, pos_max = 4095, size = 62.845 MiB, took 18.85 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409967 id_slot=0 id_task=0 p0=4096
slot create_check: id 0 | task 0 | created context checkpoint 5 of 500 (pos_min = 5119, pos_max = 5119, size = 62.853 MiB, took 18.84 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409970 id_slot=0 id_task=0 p0=5120
slot create_check: id 0 | task 0 | created context checkpoint 6 of 500 (pos_min = 6143, pos_max = 6143, size = 62.861 MiB, took 18.99 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409973 id_slot=0 id_task=0 p0=6144
slot create_check: id 0 | task 0 | created context checkpoint 7 of 500 (pos_min = 7167, pos_max = 7167, size = 62.869 MiB, took 23.34 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409977 id_slot=0 id_task=0 p0=7168
slot create_check: id 0 | task 0 | created context checkpoint 8 of 500 (pos_min = 8191, pos_max = 8191, size = 62.877 MiB, took 20.82 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409980 id_slot=0 id_task=0 p0=8192
slot create_check: id 0 | task 0 | created context checkpoint 9 of 500 (pos_min = 9215, pos_max = 9215, size = 62.884 MiB, took 19.23 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409984 id_slot=0 id_task=0 p0=9216
slot create_check: id 0 | task 0 | created context checkpoint 10 of 500 (pos_min = 10239, pos_max = 10239, size = 62.892 MiB, took 21.16 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409987 id_slot=0 id_task=0 p0=10240
slot create_check: id 0 | task 0 | created context checkpoint 11 of 500 (pos_min = 11263, pos_max = 11263, size = 62.900 MiB, took 20.07 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409991 id_slot=0 id_task=0 p0=11264
slot create_check: id 0 | task 0 | created context checkpoint 12 of 500 (pos_min = 12287, pos_max = 12287, size = 62.908 MiB, took 18.45 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409994 id_slot=0 id_task=0 p0=12288
slot create_check: id 0 | task 0 | created context checkpoint 13 of 500 (pos_min = 13311, pos_max = 13311, size = 62.916 MiB, took 18.93 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409998 id_slot=0 id_task=0 p0=13312
slot create_check: id 0 | task 0 | created context checkpoint 14 of 500 (pos_min = 14335, pos_max = 14335, size = 62.923 MiB, took 19.62 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410001 id_slot=0 id_task=0 p0=14336
slot create_check: id 0 | task 0 | created context checkpoint 15 of 500 (pos_min = 15359, pos_max = 15359, size = 62.931 MiB, took 19.80 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410004 id_slot=0 id_task=0 p0=15360
slot create_check: id 0 | task 0 | created context checkpoint 16 of 500 (pos_min = 16383, pos_max = 16383, size = 62.939 MiB, took 19.59 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410008 id_slot=0 id_task=0 p0=16384
slot create_check: id 0 | task 0 | created context checkpoint 17 of 500 (pos_min = 17407, pos_max = 17407, size = 62.947 MiB, took 19.07 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410011 id_slot=0 id_task=0 p0=17408
slot create_check: id 0 | task 0 | created context checkpoint 18 of 500 (pos_min = 18431, pos_max = 18431, size = 62.955 MiB, took 19.20 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410015 id_slot=0 id_task=0 p0=18432
slot create_check: id 0 | task 0 | created context checkpoint 19 of 500 (pos_min = 19455, pos_max = 19455, size = 62.962 MiB, took 19.80 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410018 id_slot=0 id_task=0 p0=19456
slot create_check: id 0 | task 0 | created context checkpoint 20 of 500 (pos_min = 20417, pos_max = 20417, size = 62.970 MiB, took 22.72 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410022 id_slot=0 id_task=0 p0=20418
slot create_check: id 0 | task 0 | created context checkpoint 21 of 500 (pos_min = 20423, pos_max = 20423, size = 62.970 MiB, took 19.37 ms)
slot print_timing: id 0 | task 0 |
prompt eval time = 69234.02 ms / 20423 tokens ( 3.39 ms per token, 294.99 tokens per second)
eval time = 6899.52 ms / 149 tokens ( 46.31 ms per token, 21.60 tokens per second)
total time = 76133.54 ms / 20572 tokens
INFO [ log_server_request] request | tid="140608396042240" timestamp=1773410029 remote_addr="127.0.0.1" remote_port=54182 status=200 method="POST" path="/v1/messages" params={"beta":"true"}
slot create_check: id 0 | task 0 | created context checkpoint 22 of 500 (pos_min = 20570, pos_max = 20570, size = 62.971 MiB, took 27.48 ms)
INFO [ release_slots] slot released | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=0 n_ctx=262144 n_past=20571 n_system_tokens=0 n_cache_tokens=20571 truncated=false
INFO [ slots_idle] all slots are idle | tid="140609950765056" timestamp=1773410029
======== Prompt cache: cache size: 20571, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
Recurrent model does not support banned strings.
INFO [ launch_slot_with_task] slot is processing task | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=170
======== Cache: cache_size = 20571, n_past0 = 20422, n_past1 = 20422, n_past_prompt1 = 20422, n_past2 = 20423, n_past_prompt2 = 20423
Common part does not match fully
cache : <|im_start|>assistant

I need to explore the codebase structure first to understand how ports are currently configured and used, so I'll start by examining
prompt: <|im_start|>assistant

I'll explore the codebase to understand how RTSP and ONVIF ports are currently handled, then plan how
slot apply_checkp: id 0 | task 170 | n_past = 20422, slot.prompt.tokens.size() = 20571, seq_id = 0, pos_min = 20570
slot apply_checkp: id 0 | task 170 | restored context checkpoint took 15.80 ms (pos_min = 20417, pos_max = 20417, size = 62.970 MiB)
slot apply_checkp: id 0 | task 170 | erased invalidated context checkpoint (pos_min = 20423, pos_max = 20423, size = 62.970 MiB)
slot apply_checkp: id 0 | task 170 | erased invalidated context checkpoint (pos_min = 20570, pos_max = 20570, size = 62.971 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=170 p0=20418
slot create_check: id 0 | task 170 | created context checkpoint 21 of 500 (pos_min = 20560, pos_max = 20560, size = 62.971 MiB, took 20.65 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410030 id_slot=0 id_task=170 p0=20561
slot create_check: id 0 | task 170 | created context checkpoint 22 of 500 (pos_min = 20566, pos_max = 20566, size = 62.971 MiB, took 19.90 ms)
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected empty grammar stack after accepting piece: =G (88838)
Aborted (core dumped)

Name and Version

version: 4283 (714329f)
built with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions