-
Notifications
You must be signed in to change notification settings - Fork 248
Bug: Qwen3.5 tool calls not working #1363
Copy link
Copy link
Open
Description
What happened?
While serving qwen3.5, when I send the request with tool_choice=required, I got output token = 1.
slot print_timing: id 1 | task 141 |
prompt eval time = 1013.42 ms / 914 tokens ( 1.11 ms per token, 901.90 tokens per second)
eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, 1000000.00 tokens per second)
My request pattern is agent loop, appending AI message to the message history and send back.
This is the argument of ik_llama
-m /mnt/llm-data/huggingface/hub/Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf
-np 4
-c 360000
--temp 0.6
--top-p 0.8
--top-k 20
--presence-penalty 1.5
--min-p 0.00
--chat-template-kwargs '{"enable_thinking": false}'
-ngl 999
# Common config
--color
-sm layer
--mlock
--scheduler_async
-cram 32768
--ctx-checkpoints 256
--ctx-checkpoints-interval 2048
--host 0.0.0.0
--port 8000
--jinja
-fa on
--numa numactl
--alias kCode
-b 4096
-ub 2048
-cb
--no-context-shift
--defrag-thold 0.2
--slot-save-path ./slots
--reasoning-tokens auto
Name and Version
INFO [ main] build info | tid="127061612023808" timestamp=1772656387 build=4257 commit="a903409a"
What operating system are you seeing the problem on?
No response
Relevant log output
INFO [ main] build info | tid="127061612023808" timestamp=1772656387 build=4257 commit="a903409a"
INFO [ main] system info | tid="127061612023808" timestamp=1772656387 n_threads=10 n_threads_batch=-1 total_threads=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
INFO [ main] Running without SSL | tid="127061612023808" timestamp=1772656387
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
=============================== NCCL main communicator initialized
CUDA0: using device CUDA0 - 23739 MiB free
CUDA1: using device CUDA1 - 23739 MiB free
llama_model_loader: loaded meta data with 52 key-value pairs and 733 tensors from /mnt/llm-data/huggingface/hub/Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Qwen3.5-35B-A3B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-35B-A3B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 35B-A3B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 35B A3B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-3...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35moe.embedding_length u32 = 2048
llama_model_loader: - kv 20: qwen35moe.attention.head_count u32 = 16
llama_model_loader: - kv 21: qwen35moe.attention.head_count_kv u32 = 2
llama_model_loader: - kv 22: qwen35moe.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 23: qwen35moe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 24: qwen35moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
llama_model_loader: - kv 27: qwen35moe.attention.key_length u32 = 256
llama_model_loader: - kv 28: qwen35moe.attention.value_length u32 = 256
llama_model_loader: - kv 29: qwen35moe.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 30: qwen35moe.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 31: qwen35moe.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 32: qwen35moe.ssm.state_size u32 = 128
llama_model_loader: - kv 33: qwen35moe.ssm.group_count u32 = 16
llama_model_loader: - kv 34: qwen35moe.ssm.time_step_rank u32 = 32
llama_model_loader: - kv 35: qwen35moe.ssm.inner_size u32 = 4096
llama_model_loader: - kv 36: qwen35moe.full_attention_interval u32 = 4
llama_model_loader: - kv 37: qwen35moe.rope.dimension_count u32 = 64
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 39: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,248320] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 42: tokenizer.ggml.merges arr[str,247587] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 46: general.quantization_version u32 = 2
llama_model_loader: - kv 47: general.file_type u32 = 7
llama_model_loader: - kv 48: quantize.imatrix.file str = Qwen3.5-35B-A3B-GGUF/Qwen_Qwen3.5-35B...
llama_model_loader: - kv 49: quantize.imatrix.dataset str = /training_dir/calibration_datav5.txt
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 510
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 802
llama_model_loader: - type f32: 301 tensors
llama_model_loader: - type q8_0: 30 tensors
llama_model_loader: - type q5_K: 120 tensors
llama_model_loader: - type q6_K: 42 tensors
llama_model_loader: - type bf16: 240 tensors
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen35moe
llm_load_print_meta: n_ctx_train = 262144
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 2
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 40
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 262144
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: mrope sections = [11, 11, 10, 0]
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 4096
llm_load_print_meta: ssm_d_state = 128
llm_load_print_meta: ssm_dt_rank = 32
llm_load_print_meta: ssm_n_group = 16
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 34.661 B
llm_load_print_meta: model size = 23.209 GiB (5.752 BPW)
llm_load_print_meta: repeating layers = 22.432 GiB (5.727 BPW, 33.643 B parameters)
llm_load_print_meta: general.name = Qwen3.5-35B-A3B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 3.20 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 397.85 MiB
llm_load_tensors: CUDA0 buffer size = 12061.14 MiB
llm_load_tensors: CUDA1 buffer size = 11307.07 MiB
...................................................................................................
llama_init_from_model: n_ctx = 360192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 2048
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 1
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3651.50 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 3634.75 MiB
llama_init_from_model: KV self size = 7035.00 MiB, K (f16): 3517.50 MiB, V (f16): 3517.50 MiB
llama_init_from_model: CUDA_Host output buffer size = 3.79 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=1)
llama_init_from_model: CUDA0 compute buffer size = 1663.48 MiB
llama_init_from_model: CUDA1 compute buffer size = 1956.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1423.05 MiB
llama_init_from_model: graph nodes = 2905
llama_init_from_model: graph splits = 3
llama_init_from_model: enabling only_active_experts scheduling
fragmentation: 1.00
INFO [ init] initializing slots | tid="127061612023808" timestamp=1772656404 n_slots=4
INFO [ init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=0 n_ctx_slot=90048
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
INFO [ init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=1 n_ctx_slot=90048
slot init: id 0 | task -1 | speculative decoding context not initialized
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
INFO [ init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=2 n_ctx_slot=90048
no implementations specified for speculative decoding
slot init: id 1 | task -1 | speculative decoding context not initialized
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
INFO [ init] new slot | tid="127061612023808" timestamp=1772656404 id_slot=3 n_ctx_slot=90048
slot init: id 2 | task -1 | speculative decoding context not initialized
srv init: Exclude reasoning tokens when selecting slot based on similarity: start: <think>, end: </think>
use `--reasoning-tokens none` to disable.
fragmentation: 0.99
no implementations specified for speculative decoding
slot init: id 3 | task -1 | speculative decoding context not initialized
prompt cache is enabled, size limit: 32768 MiB
use `--cache-ram 0` to disable the prompt cache
INFO [ main] model loaded | tid="127061612023808" timestamp=1772656405
INFO [ main] chat template | tid="127061612023808" timestamp=1772656405
INFO [ main] HTTP server listening | tid="127061612023808" timestamp=1772656405 n_threads_http="19" port="8000" hostname="0.0.0.0"
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656405
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656420 id_slot=0 id_task=0
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656420 id_slot=0 id_task=0 p0=0
slot create_check: id 0 | task 0 | created context checkpoint 1 of 256 (pos_min = 4095, pos_max = 4095, size = 62.845 MiB, took 447.53 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656422 id_slot=0 id_task=0 p0=4096
slot create_check: id 0 | task 0 | created context checkpoint 2 of 256 (pos_min = 8191, pos_max = 8191, size = 62.877 MiB, took 462.97 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656424 id_slot=0 id_task=0 p0=8192
slot create_check: id 0 | task 0 | created context checkpoint 3 of 256 (pos_min = 11911, pos_max = 11911, size = 62.905 MiB, took 421.51 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656425 id_slot=0 id_task=0 p0=11912
slot create_check: id 0 | task 0 | created context checkpoint 4 of 256 (pos_min = 11917, pos_max = 11917, size = 62.905 MiB, took 92.36 ms)
slot print_timing: id 0 | task 0 |
prompt eval time = 5683.00 ms / 11917 tokens ( 0.48 ms per token, 2096.95 tokens per second)
eval time = 1314.94 ms / 51 tokens ( 25.78 ms per token, 38.79 tokens per second)
total time = 6997.94 ms / 11968 tokens
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656427 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 0 | task 0 | created context checkpoint 5 of 256 (pos_min = 11966, pos_max = 11966, size = 62.905 MiB, took 92.95 ms)
INFO [ release_slots] slot released | tid="127061612023808" timestamp=1772656427 id_slot=0 id_task=0 n_ctx=360192 n_past=11967 n_system_tokens=0 n_cache_tokens=11967 truncated=false
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656427
======== Prompt cache: cache size: 11967, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55
======== Cache: cache_size = 11967, n_past0 = 11967, n_past1 = 11967, n_past_prompt1 = 11967, n_past2 = 11967, n_past_prompt2 = 11967
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55 p0=11967
slot create_check: id 0 | task 55 | created context checkpoint 6 of 256 (pos_min = 11989, pos_max = 11989, size = 62.906 MiB, took 100.06 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656428 id_slot=0 id_task=55 p0=11990
slot create_check: id 0 | task 55 | created context checkpoint 7 of 256 (pos_min = 11995, pos_max = 11995, size = 62.906 MiB, took 91.79 ms)
slot print_timing: id 0 | task 55 |
prompt eval time = 210.66 ms / 28 tokens ( 7.52 ms per token, 132.91 tokens per second)
eval time = 665.15 ms / 25 tokens ( 26.61 ms per token, 37.59 tokens per second)
total time = 875.81 ms / 53 tokens
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656429 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 0 | task 55 | created context checkpoint 8 of 256 (pos_min = 12018, pos_max = 12018, size = 62.906 MiB, took 91.90 ms)
INFO [ release_slots] slot released | tid="127061612023808" timestamp=1772656429 id_slot=0 id_task=55 n_ctx=360192 n_past=12019 n_system_tokens=0 n_cache_tokens=12019 truncated=false
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656429
======== Prompt cache: cache size: 12019, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82
======== Cache: cache_size = 12019, n_past0 = 12019, n_past1 = 12019, n_past_prompt1 = 12019, n_past2 = 12019, n_past_prompt2 = 12019
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82 p0=12019
slot create_check: id 0 | task 82 | created context checkpoint 9 of 256 (pos_min = 12142, pos_max = 12142, size = 62.907 MiB, took 153.42 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656431 id_slot=0 id_task=82 p0=12143
slot create_check: id 0 | task 82 | created context checkpoint 10 of 256 (pos_min = 12148, pos_max = 12148, size = 62.907 MiB, took 91.87 ms)
slot print_timing: id 0 | task 82 |
prompt eval time = 337.83 ms / 129 tokens ( 2.62 ms per token, 381.84 tokens per second)
eval time = 1443.86 ms / 57 tokens ( 25.33 ms per token, 39.48 tokens per second)
total time = 1781.70 ms / 186 tokens
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656432 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 0 | task 82 | created context checkpoint 11 of 256 (pos_min = 12203, pos_max = 12203, size = 62.907 MiB, took 95.55 ms)
INFO [ release_slots] slot released | tid="127061612023808" timestamp=1772656432 id_slot=0 id_task=82 n_ctx=360192 n_past=12204 n_system_tokens=0 n_cache_tokens=12204 truncated=false
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656432
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141
======== Cache: cache_size = 0, n_past0 = 0, n_past1 = 0, n_past_prompt1 = 0, n_past2 = 0, n_past_prompt2 = 0
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141 p0=0
slot create_check: id 1 | task 141 | created context checkpoint 1 of 256 (pos_min = 908, pos_max = 908, size = 62.821 MiB, took 322.89 ms)
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656433 id_slot=1 id_task=141 p0=909
slot print_timing: id 1 | task 141 |
prompt eval time = 1013.42 ms / 914 tokens ( 1.11 ms per token, 901.90 tokens per second)
eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, 1000000.00 tokens per second)
total time = 1013.42 ms / 915 tokens
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656434 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id 1 | task 141 | created context checkpoint 2 of 256 (pos_min = 913, pos_max = 913, size = 62.821 MiB, took 96.66 ms)
INFO [ release_slots] slot released | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=141 n_ctx=360192 n_past=914 n_system_tokens=0 n_cache_tokens=914 truncated=false
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656434
======== Prompt cache: cache size: 914, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [ launch_slot_with_task] slot is processing task | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144
======== Cache: cache_size = 914, n_past0 = 914, n_past1 = 914, n_past_prompt1 = 914, n_past2 = 914, n_past_prompt2 = 914
INFO [ batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144 p0=913
slot print_timing: id 1 | task 144 |
prompt eval time = 259.32 ms / 1 tokens ( 259.32 ms per token, 3.86 tokens per second)
eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, 1000000.00 tokens per second)
total time = 259.32 ms / 2 tokens
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656434 remote_addr="127.0.0.1" remote_port=58734 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [ release_slots] slot released | tid="127061612023808" timestamp=1772656434 id_slot=1 id_task=144 n_ctx=360192 n_past=914 n_system_tokens=0 n_cache_tokens=914 truncated=false
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656434
INFO [ log_server_request] request | tid="127045075435520" timestamp=1772656435 remote_addr="127.0.0.1" remote_port=58734 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="127045067042816" timestamp=1772656444 remote_addr="127.0.0.1" remote_port=50978 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="127045058650112" timestamp=1772656452 remote_addr="127.0.0.1" remote_port=54740 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="127045050257408" timestamp=1772656461 remote_addr="127.0.0.1" remote_port=54752 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [ log_server_request] request | tid="127045041864704" timestamp=1772656469 remote_addr="127.0.0.1" remote_port=55406 status=500 method="POST" path="/v1/chat/completions" params={}
INFO [ slots_idle] all slots are idle | tid="127061612023808" timestamp=1772656469Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels