-
Notifications
You must be signed in to change notification settings - Fork 248
Bug: Qwen3.5-35B-A3B-UD-Q6_K_XL - Unexpected empty grammar stack #1420
Description
What happened?
I keep getting the following crash with Qwen3.5-35B-A3B-UD-Q6_K_XL. This is the command i use:
GGML_CUDA_GRAPH_OPT=1 USE_MLOCK=true /mnt2/srcds/ai/ik_llama.cpp/build/bin/llama-server --port 8009 -m /mnt2/srcds/ai/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf --ctx-size 262144 --threads-batch 11 --threads-draft 8 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence_penalty 0.0 --repeat-penalty 1.0 --jinja --no-mmap -fa on -khad -rtr -gr -ger -ngl 333 -b 1024 -ub 1024 -ot .ffn_.*_exps.=CPU -amb 256 --ctx-checkpoints 500 -mqkv -cram -1 --cache-type-k q8_0
Disabling ctx checkpoints will fix the issue, but of course reprocessing the prompt every time makes it unusable.
Here is the log:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Laptop GPU, compute capability 8.6, VMM: yes, VRAM: 7840 MiB
INFO [ main] build info | tid="140609950765056" timestamp=1773409944 build=4283 commit="714329f4"
INFO [ main] system info | tid="140609950765056" timestamp=1773409944 n_threads=8 n_threads_batch=11 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
CUDA0: using device CUDA0 - 7610 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /mnt2/srcds/ai/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 35B-A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 18
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/imatrix_unsloth....
llama_model_loader: - kv 49: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-35B-A3B.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 76
llama_model_loader: - type f32: 301 tensors
llama_model_loader: - type f16: 90 tensors
llama_model_loader: - type q8_0: 264 tensors
llama_model_loader: - type q6_K: 78 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35moe
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 4096
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 32
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = 35B.A3B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 34.661 B
llm_load_print_meta: model size = 29.859 GiB (7.400 BPW)
llm_load_print_meta: repeating layers = 28.853 GiB (7.367 BPW, 33.643 B parameters)
llm_load_print_meta: general.name = Qwen3.5-35B-A3B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 0.63 MiB
Tensor blk.0.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.0.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.0.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.1.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.2.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.3.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight (size = 210.00 MiB) buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight (size = 272.00 MiB) buffer type overriden to CPU
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 27804.00 MiB
llm_load_tensors: CUDA_Host buffer size = 515.31 MiB
llm_load_tensors: CUDA0 buffer size = 2256.30 MiB
.................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
============ Repacked 120 tensors
llama_init_from_model: n_ctx = 262144
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 1024
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 256
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 1
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 1
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3982.82 MiB
llama_init_from_model: KV self size = 3920.00 MiB, K (q8_0): 1360.00 MiB, V (f16): 2560.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 978.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 601.03 MiB
llama_init_from_model: graph nodes = 2785
llama_init_from_model: graph splits = 82
llama_init_from_model: enabling only_active_experts scheduling
INFO [ init] initializing slots | tid="140609950765056" timestamp=1773409953 n_slots=1
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: , end:
use --reasoning-tokens none to disable.
INFO [ init] new slot | tid="140609950765056" timestamp=1773409953 id_slot=0 n_ctx_slot=262144
no implementations specified for speculative decoding
slot init: id 0 | task -1 | speculative decoding context not initialized
prompt cache is enabled, size limit: no limit
use --cache-ram 0 to disable the prompt cache
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv init: init: chat template, thinking = 1
INFO [ main] model loaded | tid="140609950765056" timestamp=1773409953
INFO [ main] HTTP server listening | tid="140609950765056" timestamp=1773409953 n_threads_http="15" port="8009" hostname="127.0.0.1"
INFO [ slots_idle] all slots are idle | tid="140609950765056" timestamp=1773409953
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
Recurrent model does not support banned strings.
INFO [ launch_slot_with_task] slot is processing task | tid="140609950765056" timestamp=1773409953 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409953 id_slot=0 id_task=0 p0=0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 500 (pos_min = 1023, pos_max = 1023, size = 62.822 MiB, took 20.97 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409956 id_slot=0 id_task=0 p0=1024
slot create_check: id 0 | task 0 | created context checkpoint 2 of 500 (pos_min = 2047, pos_max = 2047, size = 62.830 MiB, took 20.71 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409960 id_slot=0 id_task=0 p0=2048
slot create_check: id 0 | task 0 | created context checkpoint 3 of 500 (pos_min = 3071, pos_max = 3071, size = 62.837 MiB, took 20.54 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409963 id_slot=0 id_task=0 p0=3072
slot create_check: id 0 | task 0 | created context checkpoint 4 of 500 (pos_min = 4095, pos_max = 4095, size = 62.845 MiB, took 18.85 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409967 id_slot=0 id_task=0 p0=4096
slot create_check: id 0 | task 0 | created context checkpoint 5 of 500 (pos_min = 5119, pos_max = 5119, size = 62.853 MiB, took 18.84 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409970 id_slot=0 id_task=0 p0=5120
slot create_check: id 0 | task 0 | created context checkpoint 6 of 500 (pos_min = 6143, pos_max = 6143, size = 62.861 MiB, took 18.99 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409973 id_slot=0 id_task=0 p0=6144
slot create_check: id 0 | task 0 | created context checkpoint 7 of 500 (pos_min = 7167, pos_max = 7167, size = 62.869 MiB, took 23.34 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409977 id_slot=0 id_task=0 p0=7168
slot create_check: id 0 | task 0 | created context checkpoint 8 of 500 (pos_min = 8191, pos_max = 8191, size = 62.877 MiB, took 20.82 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409980 id_slot=0 id_task=0 p0=8192
slot create_check: id 0 | task 0 | created context checkpoint 9 of 500 (pos_min = 9215, pos_max = 9215, size = 62.884 MiB, took 19.23 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409984 id_slot=0 id_task=0 p0=9216
slot create_check: id 0 | task 0 | created context checkpoint 10 of 500 (pos_min = 10239, pos_max = 10239, size = 62.892 MiB, took 21.16 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409987 id_slot=0 id_task=0 p0=10240
slot create_check: id 0 | task 0 | created context checkpoint 11 of 500 (pos_min = 11263, pos_max = 11263, size = 62.900 MiB, took 20.07 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409991 id_slot=0 id_task=0 p0=11264
slot create_check: id 0 | task 0 | created context checkpoint 12 of 500 (pos_min = 12287, pos_max = 12287, size = 62.908 MiB, took 18.45 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409994 id_slot=0 id_task=0 p0=12288
slot create_check: id 0 | task 0 | created context checkpoint 13 of 500 (pos_min = 13311, pos_max = 13311, size = 62.916 MiB, took 18.93 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773409998 id_slot=0 id_task=0 p0=13312
slot create_check: id 0 | task 0 | created context checkpoint 14 of 500 (pos_min = 14335, pos_max = 14335, size = 62.923 MiB, took 19.62 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410001 id_slot=0 id_task=0 p0=14336
slot create_check: id 0 | task 0 | created context checkpoint 15 of 500 (pos_min = 15359, pos_max = 15359, size = 62.931 MiB, took 19.80 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410004 id_slot=0 id_task=0 p0=15360
slot create_check: id 0 | task 0 | created context checkpoint 16 of 500 (pos_min = 16383, pos_max = 16383, size = 62.939 MiB, took 19.59 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410008 id_slot=0 id_task=0 p0=16384
slot create_check: id 0 | task 0 | created context checkpoint 17 of 500 (pos_min = 17407, pos_max = 17407, size = 62.947 MiB, took 19.07 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410011 id_slot=0 id_task=0 p0=17408
slot create_check: id 0 | task 0 | created context checkpoint 18 of 500 (pos_min = 18431, pos_max = 18431, size = 62.955 MiB, took 19.20 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410015 id_slot=0 id_task=0 p0=18432
slot create_check: id 0 | task 0 | created context checkpoint 19 of 500 (pos_min = 19455, pos_max = 19455, size = 62.962 MiB, took 19.80 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410018 id_slot=0 id_task=0 p0=19456
slot create_check: id 0 | task 0 | created context checkpoint 20 of 500 (pos_min = 20417, pos_max = 20417, size = 62.970 MiB, took 22.72 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410022 id_slot=0 id_task=0 p0=20418
slot create_check: id 0 | task 0 | created context checkpoint 21 of 500 (pos_min = 20423, pos_max = 20423, size = 62.970 MiB, took 19.37 ms)
slot print_timing: id 0 | task 0 |
prompt eval time = 69234.02 ms / 20423 tokens ( 3.39 ms per token, 294.99 tokens per second)
eval time = 6899.52 ms / 149 tokens ( 46.31 ms per token, 21.60 tokens per second)
total time = 76133.54 ms / 20572 tokens
INFO [ log_server_request] request | tid="140608396042240" timestamp=1773410029 remote_addr="127.0.0.1" remote_port=54182 status=200 method="POST" path="/v1/messages" params={"beta":"true"}
slot create_check: id 0 | task 0 | created context checkpoint 22 of 500 (pos_min = 20570, pos_max = 20570, size = 62.971 MiB, took 27.48 ms)
INFO [ release_slots] slot released | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=0 n_ctx=262144 n_past=20571 n_system_tokens=0 n_cache_tokens=20571 truncated=false
INFO [ slots_idle] all slots are idle | tid="140609950765056" timestamp=1773410029
======== Prompt cache: cache size: 20571, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
Recurrent model does not support banned strings.
INFO [ launch_slot_with_task] slot is processing task | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=170
======== Cache: cache_size = 20571, n_past0 = 20422, n_past1 = 20422, n_past_prompt1 = 20422, n_past2 = 20423, n_past_prompt2 = 20423
Common part does not match fully
cache : <|im_start|>assistant
I need to explore the codebase structure first to understand how ports are currently configured and used, so I'll start by examining
prompt: <|im_start|>assistant
I'll explore the codebase to understand how RTSP and ONVIF ports are currently handled, then plan how
slot apply_checkp: id 0 | task 170 | n_past = 20422, slot.prompt.tokens.size() = 20571, seq_id = 0, pos_min = 20570
slot apply_checkp: id 0 | task 170 | restored context checkpoint took 15.80 ms (pos_min = 20417, pos_max = 20417, size = 62.970 MiB)
slot apply_checkp: id 0 | task 170 | erased invalidated context checkpoint (pos_min = 20423, pos_max = 20423, size = 62.970 MiB)
slot apply_checkp: id 0 | task 170 | erased invalidated context checkpoint (pos_min = 20570, pos_max = 20570, size = 62.971 MiB)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410029 id_slot=0 id_task=170 p0=20418
slot create_check: id 0 | task 170 | created context checkpoint 21 of 500 (pos_min = 20560, pos_max = 20560, size = 62.971 MiB, took 20.65 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140609950765056" timestamp=1773410030 id_slot=0 id_task=170 p0=20561
slot create_check: id 0 | task 170 | created context checkpoint 22 of 500 (pos_min = 20566, pos_max = 20566, size = 62.971 MiB, took 19.90 ms)
terminate called after throwing an instance of 'std::runtime_error'
what(): Unexpected empty grammar stack after accepting piece: =G (88838)
Aborted (core dumped)
Name and Version
version: 4283 (714329f)
built with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
No response