Skip to content

Bug: Qwen3.5 tool calls not working #1363

@chulucninh09

Description

@chulucninh09

What happened?

While serving qwen3.5, when I send the request with tool_choice=required, I got output token = 1.

slot print_timing: id  1 | task 141 | 
prompt eval time =    1013.42 ms /   914 tokens (    1.11 ms per token,   901.90 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)

My request pattern is agent loop, appending AI message to the message history and send back.

This is the argument of ik_llama

-m /mnt/llm-data/huggingface/hub/Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf
-np 4
-c 360000
--temp 0.6
--top-p 0.8
--top-k 20
--presence-penalty 1.5
--min-p 0.00
--chat-template-kwargs '{"enable_thinking": false}'
-ngl 999

# Common config
--color
-sm layer
--mlock
--scheduler_async
-cram 32768
--ctx-checkpoints 256
--ctx-checkpoints-interval 2048
--host 0.0.0.0
--port 8000
--jinja
-fa on
--numa numactl
--alias kCode
-b 4096
-ub 2048
-cb
--no-context-shift
--defrag-thold 0.2
--slot-save-path ./slots
--reasoning-tokens auto

Name and Version

INFO [                    main] build info | tid="127061612023808" timestamp=1772656387 build=4257 commit="a903409a"

What operating system are you seeing the problem on?

No response

Relevant log output

INFO [                    main] build info | tid="127061612023808" timestamp=1772656387 build=4257 commit="a903409a"
INFO [                    main] system info | tid="127061612023808" timestamp=1772656387 n_threads=10 n_threads_batch=-1 total_threads=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
INFO [                    main] Running without SSL | tid="127061612023808" timestamp=1772656387
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 23739 MiB free
CUDA1: using device CUDA1 - 23739 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /mnt/llm-data/huggingface/hub/Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.5-35B-A3B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.5-35B-A3B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 35B-A3B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.5 35B A3B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35moe.block_count u32              = 40
llama_model_loader: - kv  18:                   qwen35moe.context_length u32              = 262144
llama_model_loader: - kv  19:                 qwen35moe.embedding_length u32              = 2048
llama_model_loader: - kv  20:             qwen35moe.attention.head_count u32              = 16
llama_model_loader: - kv  21:          qwen35moe.attention.head_count_kv u32              = 2
llama_model_loader: - kv  22:          qwen35moe.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                   qwen35moe.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24: qwen35moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                     qwen35moe.expert_count u32              = 256
llama_model_loader: - kv  26:                qwen35moe.expert_used_count u32              = 8
llama_model_loader: - kv  27:             qwen35moe.attention.key_length u32              = 256
llama_model_loader: - kv  28:           qwen35moe.attention.value_length u32              = 256
llama_model_loader: - kv  29:       qwen35moe.expert_feed_forward_length u32              = 512
llama_model_loader: - kv  30: qwen35moe.expert_shared_feed_forward_length u32              = 512
llama_model_loader: - kv  31:                  qwen35moe.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  32:                   qwen35moe.ssm.state_size u32              = 128
llama_model_loader: - kv  33:                  qwen35moe.ssm.group_count u32              = 16
llama_model_loader: - kv  34:               qwen35moe.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  35:                   qwen35moe.ssm.inner_size u32              = 4096
llama_model_loader: - kv  36:          qwen35moe.full_attention_interval u32              = 4
llama_model_loader: - kv  37:             qwen35moe.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  39:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  40:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  41:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  42:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  43:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  44:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  45:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  46:               general.quantization_version u32              = 2
llama_model_loader: - kv  47:                          general.file_type u32              = 7
llama_model_loader: - kv  48:                      quantize.imatrix.file str              = Qwen3.5-35B-A3B-GGUF/Qwen_Qwen3.5-35B...
llama_model_loader: - kv  49:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav5.txt
llama_model_loader: - kv  50:             quantize.imatrix.entries_count u32              = 510
llama_model_loader: - kv  51:              quantize.imatrix.chunks_count u32              = 802
llama_model_loader: - type  f32:  301 tensors
llama_model_loader: - type q8_0:   30 tensors
llama_model_loader: - type q5_K:  120 tensors
llama_model_loader: - type q6_K:   42 tensors
llama_model_loader: - type bf16:  240 tensors
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen35moe
llm_load_print_meta: n_ctx_train      = 262144
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 0
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 40
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 262144
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: mrope sections   = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv       = 4
llm_load_print_meta: ssm_d_inner      = 4096
llm_load_print_meta: ssm_d_state      = 128
llm_load_print_meta: ssm_dt_rank      = 32
llm_load_print_meta: ssm_n_group      = 16
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 34.661 B
llm_load_print_meta: model size       = 23.209 GiB (5.752 BPW) 
llm_load_print_meta: repeating layers = 22.432 GiB (5.727 BPW, 33.643 B parameters)
llm_load_print_meta: general.name     = Qwen3.5-35B-A3B
print_info: vocab type       = BPE
print_info: n_vocab          = 248320
print_info: n_merges         = 247587
print_info: BOS token        = 11 ','
print_info: EOS token        = 248046 '<|im_end|>'
print_info: EOT token        = 248046 '<|im_end|>'
print_info: PAD token        = 248055 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 248060 '<|fim_prefix|>'
print_info: FIM SUF token    = 248062 '<|fim_suffix|>'
print_info: FIM MID token    = 248061 '<|fim_middle|>'
print_info: FIM PAD token    = 248063 '<|fim_pad|>'
print_info: FIM REP token    = 248064 '<|repo_name|>'
print_info: FIM SEP token    = 248065 '<|file_sep|>'
print_info: EOG token        = 248044 '<|endoftext|>'
print_info: EOG token        = 248046 '<|im_end|>'
print_info: EOG token        = 248063 '<|fim_pad|>'
print_info: EOG token        = 248064 '<|repo_name|>'
print_info: EOG token        = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size =    3.20 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =   397.85 MiB
llm_load_tensors:      CUDA0 buffer size = 12061.14 MiB
llm_load_tensors:      CUDA1 buffer size = 11307.07 MiB
...................................................................................................
llama_init_from_model: n_ctx         = 360192
llama_init_from_model: n_batch       = 4096
llama_init_from_model: n_ubatch      = 2048
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 1
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 10000000.0
llama_init_from_model: freq_scale    = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3651.50 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  3634.75 MiB
llama_init_from_model: KV self size  = 7035.00 MiB, K (f16): 3517.50 MiB, V (f16): 3517.50 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     3.79 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=1)
llama_init_from_model:      CUDA0 compute buffer size =  1663.48 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1956.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =  1423.05 MiB
llama_init_from_model: graph nodes  = 2905
llama_init_from_model: graph splits = 3
llama_init_from_model: enabling only_active_experts scheduling
fragmentation: 1.00
INFO [                    init] initializing slots | tid="127061612023808" timestamp=1772656404 n_slots=4
INFO [                    init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=0 n_ctx_slot=90048
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
INFO [                    init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=1 n_ctx_slot=90048
slot         init: id  0 | task -1 | speculative decoding context not initialized
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
INFO [                    init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=2 n_ctx_slot=90048
no implementations specified for speculative decoding
slot         init: id  1 | task -1 | speculative decoding context not initialized
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
INFO [                    init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=3 n_ctx_slot=90048
slot         init: id  2 | task -1 | speculative decoding context not initialized
srv          init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
slot         init: id  3 | task -1 | speculative decoding context not initialized
prompt cache is enabled, size limit: 32768 MiB
use `--cache-ram 0` to disable the prompt cache
INFO [                    main] model loaded | tid="127061612023808" timestamp=1772656405
INFO [                    main] chat template | tid="127061612023808" timestamp=1772656405 
INFO [                    main] HTTP server listening | tid="127061612023808" timestamp=1772656405 n_threads_http="19" port="8000" hostname="0.0.0.0"
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656405
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656420 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656420 id_slot=0 id_task=0 p0=0
slot create_check: id  0 | task 0 | created context checkpoint 1 of 256 (pos_min = 4095, pos_max = 4095, size = 62.845 MiB, took 447.53 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656422 id_slot=0 id_task=0 p0=4096
slot create_check: id  0 | task 0 | created context checkpoint 2 of 256 (pos_min = 8191, pos_max = 8191, size = 62.877 MiB, took 462.97 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656424 id_slot=0 id_task=0 p0=8192
slot create_check: id  0 | task 0 | created context checkpoint 3 of 256 (pos_min = 11911, pos_max = 11911, size = 62.905 MiB, took 421.51 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656425 id_slot=0 id_task=0 p0=11912
slot create_check: id  0 | task 0 | created context checkpoint 4 of 256 (pos_min = 11917, pos_max = 11917, size = 62.905 MiB, took 92.36 ms)
slot print_timing: id  0 | task 0 | 
prompt eval time =    5683.00 ms / 11917 tokens (    0.48 ms per token,  2096.95 tokens per second)
       eval time =    1314.94 ms /    51 tokens (   25.78 ms per token,    38.79 tokens per second)
      total time =    6997.94 ms / 11968 tokens
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656427 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 0 | created context checkpoint 5 of 256 (pos_min = 11966, pos_max = 11966, size = 62.905 MiB, took 92.95 ms)
INFO [           release_slots] slot released | tid="127061612023808" timestamp=1772656427 id_slot=0 id_task=0 n_ctx=360192 n_past=11967 n_system_tokens=0 n_cache_tokens=11967 truncated=false
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656427
======== Prompt cache: cache size: 11967, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55
======== Cache: cache_size = 11967, n_past0 =  11967, n_past1 =  11967, n_past_prompt1 = 11967,  n_past2 =  11967, n_past_prompt2 =  11967
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55 p0=11967
slot create_check: id  0 | task 55 | created context checkpoint 6 of 256 (pos_min = 11989, pos_max = 11989, size = 62.906 MiB, took 100.06 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55 p0=11990
slot create_check: id  0 | task 55 | created context checkpoint 7 of 256 (pos_min = 11995, pos_max = 11995, size = 62.906 MiB, took 91.79 ms)
slot print_timing: id  0 | task 55 | 
prompt eval time =     210.66 ms /    28 tokens (    7.52 ms per token,   132.91 tokens per second)
       eval time =     665.15 ms /    25 tokens (   26.61 ms per token,    37.59 tokens per second)
      total time =     875.81 ms /    53 tokens
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656429 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 55 | created context checkpoint 8 of 256 (pos_min = 12018, pos_max = 12018, size = 62.906 MiB, took 91.90 ms)
INFO [           release_slots] slot released | tid="127061612023808" timestamp=1772656429 id_slot=0 id_task=55 n_ctx=360192 n_past=12019 n_system_tokens=0 n_cache_tokens=12019 truncated=false
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656429
======== Prompt cache: cache size: 12019, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82
======== Cache: cache_size = 12019, n_past0 =  12019, n_past1 =  12019, n_past_prompt1 = 12019,  n_past2 =  12019, n_past_prompt2 =  12019
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82 p0=12019
slot create_check: id  0 | task 82 | created context checkpoint 9 of 256 (pos_min = 12142, pos_max = 12142, size = 62.907 MiB, took 153.42 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82 p0=12143
slot create_check: id  0 | task 82 | created context checkpoint 10 of 256 (pos_min = 12148, pos_max = 12148, size = 62.907 MiB, took 91.87 ms)
slot print_timing: id  0 | task 82 | 
prompt eval time =     337.83 ms /   129 tokens (    2.62 ms per token,   381.84 tokens per second)
       eval time =    1443.86 ms /    57 tokens (   25.33 ms per token,    39.48 tokens per second)
      total time =    1781.70 ms /   186 tokens
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656432 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  0 | task 82 | created context checkpoint 11 of 256 (pos_min = 12203, pos_max = 12203, size = 62.907 MiB, took 95.55 ms)
INFO [           release_slots] slot released | tid="127061612023808" timestamp=1772656432 id_slot=0 id_task=82 n_ctx=360192 n_past=12204 n_system_tokens=0 n_cache_tokens=12204 truncated=false
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656432
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141 p0=0
slot create_check: id  1 | task 141 | created context checkpoint 1 of 256 (pos_min = 908, pos_max = 908, size = 62.821 MiB, took 322.89 ms)
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141 p0=909
slot print_timing: id  1 | task 141 | 
prompt eval time =    1013.42 ms /   914 tokens (    1.11 ms per token,   901.90 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =    1013.42 ms /   915 tokens
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656434 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  1 | task 141 | created context checkpoint 2 of 256 (pos_min = 913, pos_max = 913, size = 62.821 MiB, took 96.66 ms)
INFO [           release_slots] slot released | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=141 n_ctx=360192 n_past=914 n_system_tokens=0 n_cache_tokens=914 truncated=false
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656434
======== Prompt cache: cache size: 914, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144
======== Cache: cache_size = 914, n_past0 =  914, n_past1 =  914, n_past_prompt1 = 914,  n_past2 =  914, n_past_prompt2 =  914
INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144 p0=913
slot print_timing: id  1 | task 144 | 
prompt eval time =     259.32 ms /     1 tokens (  259.32 ms per token,     3.86 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =     259.32 ms /     2 tokens
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656434 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [           release_slots] slot released | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144 n_ctx=360192 n_past=914 n_system_tokens=0 n_cache_tokens=914 truncated=false
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656434
INFO [      log_server_request] request | tid="127045075435520" timestamp=1772656435 remote_addr="127.0.0.1" remote_port=58734 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="127045067042816" timestamp=1772656444 remote_addr="127.0.0.1" remote_port=50978 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="127045058650112" timestamp=1772656452 remote_addr="127.0.0.1" remote_port=54740 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="127045050257408" timestamp=1772656461 remote_addr="127.0.0.1" remote_port=54752 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [      log_server_request] request | tid="127045041864704" timestamp=1772656469 remote_addr="127.0.0.1" remote_port=55406 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [              slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656469

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions