-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
$ llama-server --version
load_backend: loaded RPC backend from /usr/local/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /usr/local/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /usr/local/bin/libggml-cpu-icelake.so
version: 6393 (408ff52)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
Radeon 8060S Graphics (RADV GFX1151) on AMD Ryzen™ Al Max+ 395 board with 128GB RAM
Models
Tried with several models: gemma 3, qwen3-coder (although it had other problems with tool calls as well), currently running hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL
Problem description & steps to reproduce
When trying to use agentic tool calls in my ide (Zed), the "edit_file" tool call always fails because mode is always set to {}:
I confirmed with wireshark that the "mode" parameter is being transferred as "{}":
{
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "YcwIK2t03bauo1ZfGOg2UZJqbhmE07pY",
"type": "function",
"function": {
"name": "edit_file",
"arguments": "{\"display_description\":\"Remove all <h3> tags and their contents\",\"path\":\"agboost/frontend/app/templates/help.hbs\",\"mode\":{}}"
}
}]
}.. the schema for which is being sent as:
"function": {
"name": "edit_file",
"description": "This is a tool for creating a new file or editing an existing file. For moving or renaming files, you should generally use the `terminal` tool with the 'mv' command instead.\n\nBefore using this tool:\n\n1. Use the `read_file` tool to understand the file's contents and context\n\n2. Verify the directory path is correct (only applicable when creating new files):\n - Use the `list_directory` tool to verify the parent directory exists and is the correct location\n",
"parameters": {
"definitions": {
"EditFileMode": {
"type": "string",
"enum": ["edit", "create", "overwrite"]
}
},
"required": ["display_description", "path", "mode"],
"type": "object",
"properties": {
"display_description": {
"description": "A one-line, user-friendly markdown description of the edit. This will be\nshown in the UI and also passed to another model to perform the edit.\n\nBe terse, but also descriptive in what you want to achieve with this\nedit. Avoid generic instructions.\n\nNEVER mention the file path in this description.\n\n<example>Fix API endpoint URLs</example>\n<example>Update copyright year in `page_footer`</example>\n\nMake sure to include this field before all the others in the input object\nso that we can display it immediately.",
"type": "string"
},
"path": {
"description": "The full path of the file to create or modify in the project.\n\nWARNING: When specifying which file path need changing, you MUST\nstart each path with one of the project's root directories.\n\nThe following examples assume we have two root directories in the project:\n- /a/b/backend\n- /c/d/frontend\n\n<example>\n`backend/src/main.rs`\n\nNotice how the file path starts with `backend`. Without that, the path\nwould be ambiguous and the call would fail!\n</example>\n\n<example>\n`frontend/db.js`\n</example>",
"type": "string"
},
"mode": {
"description": "The mode of operation on the file. Possible values:\n- 'edit': Make granular edits to an existing file.\n- 'create': Create a new file if it doesn't exist.\n- 'overwrite': Replace the entire contents of an existing file.\n\nWhen a file already exists or you just created it, prefer editing\nit as opposed to recreating it from scratch.",
"allOf": [{
"$ref": "#/definitions/EditFileMode"
}]
}
},
"additionalProperties": false
}
}First Bad Commit
No response
Relevant log output
$ /usr/local/bin/llama-server --jinja --host 0.0.0.0 --port 8080 -c 0 -hf unsloth/gpt-oss-120b-GGUF:Q8_K_XL
load_backend: loaded RPC backend from /usr/local/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /usr/local/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /usr/local/bin/libggml-cpu-icelake.so
curl_perform_with_retry: HEAD https://huggingface.co/unsloth/gpt-oss-120b-GGUF/resolve/main/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /usr/share/ollama/.cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_UD-Q8_K_XL_gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf
curl_perform_with_retry: HEAD https://huggingface.co/unsloth/gpt-oss-120b-GGUF/resolve/main/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00002-of-00002.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /usr/share/ollama/.cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_UD-Q8_K_XL_gpt-oss-120b-UD-Q8_K_XL-00002-of-00002.gguf
build: 6393 (408ff524) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 31
main: loading model
srv load_model: loading model '/usr/share/ollama/.cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_UD-Q8_K_XL_gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (Radeon 8060S Graphics (RADV GFX1151)) - 74597 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 40 key-value pairs and 687 tensors from /usr/share/ollama/.cache/llama.cpp/unsloth_gpt-oss-120b-GGUF_UD-Q8_K_XL_gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gpt-oss
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gpt-Oss-120B
llama_model_loader: - kv 3: general.basename str = Gpt-Oss-120B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 120B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.tags arr[str,2] = ["vllm", "text-generation"]
llama_model_loader: - kv 9: gpt-oss.block_count u32 = 36
llama_model_loader: - kv 10: gpt-oss.context_length u32 = 131072
llama_model_loader: - kv 11: gpt-oss.embedding_length u32 = 2880
llama_model_loader: - kv 12: gpt-oss.feed_forward_length u32 = 2880
llama_model_loader: - kv 13: gpt-oss.attention.head_count u32 = 64
llama_model_loader: - kv 14: gpt-oss.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: gpt-oss.rope.freq_base f32 = 150000.000000
llama_model_loader: - kv 16: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: gpt-oss.expert_count u32 = 128
llama_model_loader: - kv 18: gpt-oss.expert_used_count u32 = 4
llama_model_loader: - kv 19: gpt-oss.attention.key_length u32 = 64
llama_model_loader: - kv 20: gpt-oss.attention.value_length u32 = 64
llama_model_loader: - kv 21: gpt-oss.attention.sliding_window u32 = 128
llama_model_loader: - kv 22: gpt-oss.expert_feed_forward_length u32 = 2880
llama_model_loader: - kv 23: gpt-oss.rope.scaling.type str = yarn
llama_model_loader: - kv 24: gpt-oss.rope.scaling.factor f32 = 32.000000
llama_model_loader: - kv 25: gpt-oss.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = gpt-4o
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,201088] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 199998
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 200002
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 200017
llama_model_loader: - kv 34: tokenizer.chat_template str = {# Chat template fixes by Unsloth #}\n...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.file_type u32 = 7
llama_model_loader: - kv 37: split.no u16 = 0
llama_model_loader: - kv 38: split.tensors.count i32 = 687
llama_model_loader: - kv 39: split.count u16 = 2
llama_model_loader: - type f32: 433 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type q8_0: 144 tensors
llama_model_loader: - type mxfp4: 108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 60.03 GiB (4.41 BPW)
load: printing all EOG tokens:
load: - 199999 ('<|endoftext|>')
load: - 200002 ('<|return|>')
load: - 200007 ('<|end|>')
load: - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch = gpt-oss
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2880
print_info: n_layer = 36
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2880
print_info: n_expert = 128
print_info: n_expert_used = 4
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 120B
print_info: model params = 116.83 B
print_info: general.name = Gpt-Oss-120B
print_info: n_ff_exp = 2880
print_info: vocab type = BPE
print_info: n_vocab = 201088
print_info: n_merges = 446189
print_info: BOS token = 199998 '<|startoftext|>'
print_info: EOS token = 200002 '<|return|>'
print_info: EOT token = 199999 '<|endoftext|>'
print_info: PAD token = 200017 '<|reserved_200017|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 199999 '<|endoftext|>'
print_info: EOG token = 200002 '<|return|>'
print_info: EOG token = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: Vulkan0 model buffer size = 60369.43 MiB
load_tensors: CPU_Mapped model buffer size = 1104.61 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: Vulkan_Host output buffer size = 0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache: Vulkan0 KV buffer size = 4608.00 MiB
llama_kv_cache: size = 4608.00 MiB (131072 cells, 18 layers, 1/1 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 768 cells
llama_kv_cache: Vulkan0 KV buffer size = 27.00 MiB
llama_kv_cache: size = 27.00 MiB ( 768 cells, 18 layers, 1/1 seqs), K (f16): 13.50 MiB, V (f16): 13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: Vulkan0 compute buffer size = 413.52 MiB
llama_context: Vulkan_Host compute buffer size = 263.15 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 131072
main: model loaded
main: chat template, chat_template: {# Chat template fixes by Unsloth #}
{#-
In addition to the normal inputs of `messages` and `tools`, this template also accepts the
following kwargs:
- "builtin_tools": A list, can contain "browser" and/or "python".
- "model_identity": A string that optionally describes the model identity.
- "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
#}
{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
{%- if param_spec.type == "array" -%}
{%- if param_spec['items'] -%}
{%- if param_spec['items']['type'] == "string" -%}
{{- "string[]" }}
{%- elif param_spec['items']['type'] == "number" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "integer" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "boolean" -%}
{{- "boolean[]" }}
{%- else -%}
{%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
{%- if inner_type == "object | object" or inner_type|length > 50 -%}
{{- "any[]" }}
{%- else -%}
{{- inner_type + "[]" }}
{%- endif -%}
{%- endif -%}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- else -%}
{{- "any[]" }}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
{#- Handle array of types like ["object", "object"] from Union[dict, list] #}
{%- if param_spec.type | length > 1 -%}
{{- param_spec.type | join(" | ") }}
{%- else -%}
{{- param_spec.type[0] }}
{%- endif -%}
{%- elif param_spec.oneOf -%}
{#- Handle oneOf schemas - check for complex unions and fallback to any #}
{%- set has_object_variants = false -%}
{%- for variant in param_spec.oneOf -%}
{%- if variant.type == "object" -%}
{%- set has_object_variants = true -%}
{%- endif -%}
{%- endfor -%}
{%- if has_object_variants and param_spec.oneOf|length > 1 -%}
{{- "any" }}
{%- else -%}
{%- for variant in param_spec.oneOf -%}
{{- render_typescript_type(variant, required_params) -}}
{%- if variant.description %}
{{- "// " + variant.description }}
{%- endif -%}
{%- if variant.default is defined %}
{{ "// default: " + variant.default|tojson }}
{%- endif -%}
{%- if not loop.last %}
{{- " | " }}
{% endif -%}
{%- endfor -%}
{%- endif -%}
{%- elif param_spec.type == "string" -%}
{%- if param_spec.enum -%}
{{- '"' + param_spec.enum|join('" | "') + '"' -}}
{%- else -%}
{{- "string" }}
{%- if param_spec.nullable %}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type == "number" -%}
{{- "number" }}
{%- elif param_spec.type == "integer" -%}
{{- "number" }}
{%- elif param_spec.type == "boolean" -%}
{{- "boolean" }}
{%- elif param_spec.type == "object" -%}
{%- if param_spec.properties -%}
{{- "{\n" }}
{%- for prop_name, prop_spec in param_spec.properties.items() -%}
{{- prop_name -}}
{%- if prop_name not in (param_spec.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{ render_typescript_type(prop_spec, param_spec.required or []) }}
{%- if not loop.last -%}
{{-", " }}
{%- endif -%}
{%- endfor -%}
{{- "}" }}
{%- else -%}
{{- "object" }}
{%- endif -%}
{%- else -%}
{{- "any" }}
{%- endif -%}
{%- endmacro -%}
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
{%- if param_name not in (tool.parameters.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{- render_typescript_type(param_spec, tool.parameters.required or []) }}
{%- if param_spec.default is defined -%}
{%- if param_spec.enum %}
{{- ", // default: " + param_spec.default }}
{%- elif param_spec.oneOf %}
{{- "// default: " + param_spec.default }}
{%- else %}
{{- ", // default: " + param_spec.default|tojson }}
{%- endif -%}
{%- endif -%}
{%- if not loop.last %}
{{- ",\n" }}
{%- else %}
{{- ",\n" }}
{%- endif -%}
{%- endfor %}
{{- "}) => any;\n\n" }}
{%- else -%}
{{- "() => any;\n\n" }}
{%- endif -%}
{%- endfor %}
{{- "} // namespace " + namespace_name }}
{%- endmacro -%}
{%- macro render_builtin_tools(browser_tool, python_tool) -%}
{%- if browser_tool %}
{{- "## browser\n\n" }}
{{- "// Tool for browsing.\n" }}
{{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
{{- "// Cite information from the tool using the following format:\n" }}
{{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
{{- "// Do not quote more than 10 words directly from the tool output.\n" }}
{{- "// sources=web (default: web)\n" }}
{{- "namespace browser {\n\n" }}
{{- "// Searches for information related to `query` and displays `topn` results.\n" }}
{{- "type search = (_: {\n" }}
{{- "query: string,\n" }}
{{- "topn?: number, // default: 10\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
{{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
{{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
{{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
{{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
{{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
{{- "type open = (_: {\n" }}
{{- "id?: number | string, // default: -1\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "loc?: number, // default: -1\n" }}
{{- "num_lines?: number, // default: -1\n" }}
{{- "view_source?: boolean, // default: false\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
{{- "type find = (_: {\n" }}
{{- "pattern: string,\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "}) => any;\n\n" }}
{{- "} // namespace browser\n\n" }}
{%- endif -%}
{%- if python_tool %}
{{- "## python\n\n" }}
{{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
{{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
{%- endif -%}
{%- endmacro -%}
{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
{%- if model_identity is not defined %}
{%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
{%- endif %}
{{- model_identity + "\n" }}
{{- "Knowledge cutoff: 2024-06\n" }}
{{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
{%- if reasoning_effort is not defined %}
{%- set reasoning_effort = "medium" %}
{%- endif %}
{{- "Reasoning: " + reasoning_effort + "\n\n" }}
{%- if builtin_tools is defined and builtin_tools is not none %}
{{- "# Tools\n\n" }}
{%- set available_builtin_tools = namespace(browser=false, python=false) %}
{%- for tool in builtin_tools %}
{%- if tool == "browser" %}
{%- set available_builtin_tools.browser = true %}
{%- elif tool == "python" %}
{%- set available_builtin_tools.python = true %}
{%- endif %}
{%- endfor %}
{{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
{%- endif -%}
{{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
{%- if tools -%}
{{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
{%- endif -%}
{%- endmacro -%}
{#- Main Template Logic ================================================= #}
{#- Set defaults #}
{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}
{#- Extract developer message #}
{%- if developer_instructions is defined and developer_instructions is not none %}
{%- set developer_message = developer_instructions %}
{%- set loop_messages = messages %}
{%- elif messages[0].role == "developer" or messages[0].role == "system" %}
{%- set developer_message = messages[0].content %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set developer_message = "" %}
{%- set loop_messages = messages %}
{%- endif %}
{#- Render developer message #}
{%- if developer_message or tools %}
{{- "<|start|>developer<|message|>" }}
{%- if developer_message %}
{{- "# Instructions\n\n" }}
{{- developer_message }}
{%- endif %}
{%- if tools -%}
{%- if developer_message %}
{{- "\n\n" }}
{%- endif %}
{{- "# Tools\n\n" }}
{{- render_tool_namespace("functions", tools) }}
{%- endif -%}
{{- "<|end|>" }}
{%- endif %}
{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
{#- At this point only assistant/user/tool messages should remain #}
{%- if message.role == 'assistant' -%}
{#- Checks to ensure the messages are being passed in the format we expect #}
{%- if "content" in message %}
{%- if false %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "thinking" in message %}
{%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "tool_calls" in message %}
{#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
{#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
{#- when we render CoT/analysis messages in inference. #}
{%- set future_final_message = namespace(found=false) %}
{%- for future_message in loop_messages[loop.index:] %}
{%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
{%- set future_final_message.found = true %}
{%- endif %}
{%- endfor %}
{#- We assume max 1 tool call per message, and so we infer the tool call name #}
{#- in "tool" messages from the most recent assistant tool call name #}
{%- set tool_call = message.tool_calls[0] %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{%- if message.content and message.thinking %}
{{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
{%- elif message.content and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
{%- elif message.thinking and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{{- "<|start|>assistant to=" }}
{{- "functions." + tool_call.name + "<|channel|>commentary " }}
{{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments|tojson }}
{%- endif %}
{{- "<|call|>" }}
{%- set last_tool_call.name = tool_call.name %}
{%- elif loop.last and not add_generation_prompt %}
{#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
{#- This is a situation that should only occur in training, never in inference. #}
{%- if "thinking" in message %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{#- <|return|> indicates the end of generation, but <|end|> does not #}
{#- <|return|> should never be an input to the model, but we include it as the final token #}
{#- when training, so the model learns to emit it. #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- elif "thinking" in message %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- else %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- endif %}
{%- elif message.role == 'tool' -%}
{%- if last_tool_call.name is none %}
{{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
{%- endif %}
{{- "<|start|>functions." + last_tool_call.name }}
{%- if message.content is string %}
{{- " to=assistant<|channel|>commentary<|message|>" + message.content + "<|end|>" }}
{%- else %}
{{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
{%- endif %}
{%- elif message.role == 'user' -%}
{{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
{%- endif -%}
{%- endfor -%}
{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}
{# Copyright 2025-present Unsloth. Apache 2.0 License. Unsloth chat template fixes. Edited from ggml-org & OpenAI #}, example_format: '<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv update_slots: all slots are idle
srv log_server_r: request: GET / 192.168.1.183 200
srv log_server_r: request: GET /props 192.168.1.183 200
srv log_server_r: request: GET /favicon.ico 192.168.1.183 404
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 72
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 72, n_tokens = 72, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 72, n_tokens = 72
slot update_slots: id 0 | task 0 | SWA checkpoint create, pos_min = 0, pos_max = 71, size = 2.533 MiB, total = 1/3 (2.533 MiB)
slot release: id 0 | task 0 | stop processing: n_past = 2204, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 707.42 ms / 72 tokens ( 9.83 ms per token, 101.78 tokens per second)
eval time = 51866.70 ms / 2133 tokens ( 24.32 ms per token, 41.12 tokens per second)
total time = 52574.12 ms / 2205 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2134 | processing task
slot update_slots: id 0 | task 2134 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 5198
slot update_slots: id 0 | task 2134 | n_past = 59, cache_tokens.size() = 2204, seq_id = 0, pos_min = 1436, n_swa = 128
slot update_slots: id 0 | task 2134 | SWA checkpoint restore, pos_min = 0, pos_max = 71, size = 2.533 MiB
slot update_slots: id 0 | task 2134 | kv cache rm [59, end)
slot update_slots: id 0 | task 2134 | prompt processing progress, n_past = 2107, n_tokens = 2048, progress = 0.393998
slot update_slots: id 0 | task 2134 | kv cache rm [2107, end)
slot update_slots: id 0 | task 2134 | prompt processing progress, n_past = 4155, n_tokens = 2048, progress = 0.787995
slot update_slots: id 0 | task 2134 | kv cache rm [4155, end)
slot update_slots: id 0 | task 2134 | prompt processing progress, n_past = 5198, n_tokens = 1043, progress = 0.988649
slot update_slots: id 0 | task 2134 | prompt done, n_past = 5198, n_tokens = 1043
slot update_slots: id 0 | task 2134 | SWA checkpoint create, pos_min = 4430, pos_max = 5197, size = 27.009 MiB, total = 2/3 (29.542 MiB)
slot release: id 0 | task 2134 | stop processing: n_past = 5350, truncated = 0
slot print_timing: id 0 | task 2134 |
prompt eval time = 13729.82 ms / 5139 tokens ( 2.67 ms per token, 374.29 tokens per second)
eval time = 3759.67 ms / 153 tokens ( 24.57 ms per token, 40.70 tokens per second)
total time = 17489.49 ms / 5292 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2290 | processing task
slot update_slots: id 0 | task 2290 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6206
slot update_slots: id 0 | task 2290 | kv cache rm [5198, end)
slot update_slots: id 0 | task 2290 | prompt processing progress, n_past = 6206, n_tokens = 1008, progress = 0.162423
slot update_slots: id 0 | task 2290 | prompt done, n_past = 6206, n_tokens = 1008
slot update_slots: id 0 | task 2290 | SWA checkpoint create, pos_min = 5438, pos_max = 6205, size = 27.009 MiB, total = 3/3 (56.551 MiB)
slot release: id 0 | task 2290 | stop processing: n_past = 6374, truncated = 0
slot print_timing: id 0 | task 2290 |
prompt eval time = 2737.34 ms / 1008 tokens ( 2.72 ms per token, 368.24 tokens per second)
eval time = 4294.63 ms / 169 tokens ( 25.41 ms per token, 39.35 tokens per second)
total time = 7031.97 ms / 1177 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2460 | processing task
slot update_slots: id 0 | task 2460 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6272
slot update_slots: id 0 | task 2460 | kv cache rm [6206, end)
slot update_slots: id 0 | task 2460 | prompt processing progress, n_past = 6272, n_tokens = 66, progress = 0.010523
slot update_slots: id 0 | task 2460 | prompt done, n_past = 6272, n_tokens = 66
slot update_slots: id 0 | task 2460 | SWA checkpoint erase, pos_min = 5438, pos_max = 6205, size = 27.009 MiB
slot update_slots: id 0 | task 2460 | SWA checkpoint create, pos_min = 5606, pos_max = 6271, size = 23.422 MiB, total = 3/3 (77.441 MiB)
slot release: id 0 | task 2460 | stop processing: n_past = 6416, truncated = 0
slot print_timing: id 0 | task 2460 |
prompt eval time = 652.21 ms / 66 tokens ( 9.88 ms per token, 101.20 tokens per second)
eval time = 3604.97 ms / 145 tokens ( 24.86 ms per token, 40.22 tokens per second)
total time = 4257.17 ms / 211 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2606 | processing task
slot update_slots: id 0 | task 2606 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6338
slot update_slots: id 0 | task 2606 | kv cache rm [6272, end)
slot update_slots: id 0 | task 2606 | prompt processing progress, n_past = 6338, n_tokens = 66, progress = 0.010413
slot update_slots: id 0 | task 2606 | prompt done, n_past = 6338, n_tokens = 66
slot update_slots: id 0 | task 2606 | SWA checkpoint erase, pos_min = 5606, pos_max = 6271, size = 23.422 MiB
slot update_slots: id 0 | task 2606 | SWA checkpoint create, pos_min = 5648, pos_max = 6337, size = 24.266 MiB, total = 3/3 (74.697 MiB)
slot release: id 0 | task 2606 | stop processing: n_past = 6429, truncated = 0
slot print_timing: id 0 | task 2606 |
prompt eval time = 623.96 ms / 66 tokens ( 9.45 ms per token, 105.78 tokens per second)
eval time = 2279.47 ms / 92 tokens ( 24.78 ms per token, 40.36 tokens per second)
total time = 2903.43 ms / 158 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2699 | processing task
slot update_slots: id 0 | task 2699 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6404
slot update_slots: id 0 | task 2699 | kv cache rm [6338, end)
slot update_slots: id 0 | task 2699 | prompt processing progress, n_past = 6404, n_tokens = 66, progress = 0.010306
slot update_slots: id 0 | task 2699 | prompt done, n_past = 6404, n_tokens = 66
slot update_slots: id 0 | task 2699 | SWA checkpoint erase, pos_min = 5648, pos_max = 6337, size = 24.266 MiB
slot update_slots: id 0 | task 2699 | SWA checkpoint create, pos_min = 5661, pos_max = 6403, size = 26.130 MiB, total = 3/3 (73.818 MiB)
slot release: id 0 | task 2699 | stop processing: n_past = 6598, truncated = 0
slot print_timing: id 0 | task 2699 |
prompt eval time = 636.11 ms / 66 tokens ( 9.64 ms per token, 103.76 tokens per second)
eval time = 4869.78 ms / 195 tokens ( 24.97 ms per token, 40.04 tokens per second)
total time = 5505.89 ms / 261 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2895 | processing task
slot update_slots: id 0 | task 2895 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 148
slot update_slots: id 0 | task 2895 | n_past = 59, cache_tokens.size() = 6598, seq_id = 0, pos_min = 5830, n_swa = 128
slot update_slots: id 0 | task 2895 | forcing full prompt re-processing due to lack of cache data (likely due to SWA, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 2895 | kv cache rm [0, end)
slot update_slots: id 0 | task 2895 | prompt processing progress, n_past = 148, n_tokens = 148, progress = 1.000000
slot update_slots: id 0 | task 2895 | prompt done, n_past = 148, n_tokens = 148
srv params_from_: Chat format: GPT-OSS
slot update_slots: id 0 | task 2895 | SWA checkpoint create, pos_min = 0, pos_max = 147, size = 5.205 MiB, total = 1/3 (5.205 MiB)
slot release: id 0 | task 2895 | stop processing: n_past = 239, truncated = 0
slot print_timing: id 0 | task 2895 |
prompt eval time = 827.83 ms / 148 tokens ( 5.59 ms per token, 178.78 tokens per second)
eval time = 2105.72 ms / 92 tokens ( 22.89 ms per token, 43.69 tokens per second)
total time = 2933.55 ms / 240 tokens
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
slot launch_slot_: id 0 | task 2897 | processing task
slot update_slots: id 0 | task 2897 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6470
slot update_slots: id 0 | task 2897 | kv cache rm [59, end)
slot update_slots: id 0 | task 2897 | prompt processing progress, n_past = 2107, n_tokens = 2048, progress = 0.316538
slot update_slots: id 0 | task 2897 | kv cache rm [2107, end)
slot update_slots: id 0 | task 2897 | prompt processing progress, n_past = 4155, n_tokens = 2048, progress = 0.633076
srv cancel_tasks: cancel task, id_task = 2897
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
slot release: id 0 | task 2897 | stop processing: n_past = 4155, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 2992 | processing task
slot update_slots: id 0 | task 2992 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6484
slot update_slots: id 0 | task 2992 | kv cache rm [4155, end)
slot update_slots: id 0 | task 2992 | prompt processing progress, n_past = 6203, n_tokens = 2048, progress = 0.315854
slot update_slots: id 0 | task 2992 | kv cache rm [6203, end)
slot update_slots: id 0 | task 2992 | prompt processing progress, n_past = 6484, n_tokens = 281, progress = 0.359192
slot update_slots: id 0 | task 2992 | prompt done, n_past = 6484, n_tokens = 281
slot update_slots: id 0 | task 2992 | SWA checkpoint create, pos_min = 5716, pos_max = 6483, size = 27.009 MiB, total = 2/3 (32.214 MiB)
slot release: id 0 | task 2992 | stop processing: n_past = 6702, truncated = 0
slot print_timing: id 0 | task 2992 |
prompt eval time = 6479.42 ms / 2329 tokens ( 2.78 ms per token, 359.45 tokens per second)
eval time = 5397.34 ms / 219 tokens ( 24.65 ms per token, 40.58 tokens per second)
total time = 11876.76 ms / 2548 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 3213 | processing task
slot update_slots: id 0 | task 3213 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6550
slot update_slots: id 0 | task 3213 | kv cache rm [6484, end)
slot update_slots: id 0 | task 3213 | prompt processing progress, n_past = 6550, n_tokens = 66, progress = 0.010076
slot update_slots: id 0 | task 3213 | prompt done, n_past = 6550, n_tokens = 66
slot update_slots: id 0 | task 3213 | SWA checkpoint create, pos_min = 5934, pos_max = 6549, size = 21.664 MiB, total = 3/3 (53.878 MiB)
slot release: id 0 | task 3213 | stop processing: n_past = 6675, truncated = 0
slot print_timing: id 0 | task 3213 |
prompt eval time = 624.29 ms / 66 tokens ( 9.46 ms per token, 105.72 tokens per second)
eval time = 3122.27 ms / 126 tokens ( 24.78 ms per token, 40.36 tokens per second)
total time = 3746.56 ms / 192 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 3340 | processing task
slot update_slots: id 0 | task 3340 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6616
slot update_slots: id 0 | task 3340 | kv cache rm [6550, end)
slot update_slots: id 0 | task 3340 | prompt processing progress, n_past = 6616, n_tokens = 66, progress = 0.009976
slot update_slots: id 0 | task 3340 | prompt done, n_past = 6616, n_tokens = 66
slot update_slots: id 0 | task 3340 | SWA checkpoint erase, pos_min = 5934, pos_max = 6549, size = 21.664 MiB
slot update_slots: id 0 | task 3340 | SWA checkpoint create, pos_min = 5934, pos_max = 6615, size = 23.985 MiB, total = 3/3 (72.658 MiB)
srv cancel_tasks: cancel task, id_task = 3340
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
slot release: id 0 | task 3340 | stop processing: n_past = 6758, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 3485 | processing task
slot update_slots: id 0 | task 3485 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 6484
slot update_slots: id 0 | task 3485 | kv cache rm [6477, end)
slot update_slots: id 0 | task 3485 | prompt processing progress, n_past = 6484, n_tokens = 7, progress = 0.001080
slot update_slots: id 0 | task 3485 | prompt done, n_past = 6484, n_tokens = 7
slot update_slots: id 0 | task 3485 | SWA checkpoint erase, pos_min = 5934, pos_max = 6615, size = 23.985 MiB
slot update_slots: id 0 | task 3485 | SWA checkpoint create, pos_min = 5990, pos_max = 6483, size = 17.373 MiB, total = 3/3 (63.022 MiB)
slot release: id 0 | task 3485 | stop processing: n_past = 8778, truncated = 0
slot print_timing: id 0 | task 3485 |
prompt eval time = 188.20 ms / 7 tokens ( 26.89 ms per token, 37.20 tokens per second)
eval time = 57333.79 ms / 2295 tokens ( 24.98 ms per token, 40.03 tokens per second)
total time = 57521.98 ms / 2302 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 5781 | processing task
slot update_slots: id 0 | task 5781 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 7287
slot update_slots: id 0 | task 5781 | n_past = 6485, cache_tokens.size() = 8778, seq_id = 0, pos_min = 8010, n_swa = 128
slot update_slots: id 0 | task 5781 | SWA checkpoint restore, pos_min = 5990, pos_max = 6483, size = 17.373 MiB
slot update_slots: id 0 | task 5781 | kv cache rm [6483, end)
slot update_slots: id 0 | task 5781 | prompt processing progress, n_past = 7287, n_tokens = 804, progress = 0.110333
slot update_slots: id 0 | task 5781 | prompt done, n_past = 7287, n_tokens = 804
slot update_slots: id 0 | task 5781 | SWA checkpoint erase, pos_min = 5990, pos_max = 6483, size = 17.373 MiB
slot update_slots: id 0 | task 5781 | SWA checkpoint create, pos_min = 6646, pos_max = 7286, size = 22.543 MiB, total = 3/3 (63.901 MiB)
slot release: id 0 | task 5781 | stop processing: n_past = 7514, truncated = 0
slot print_timing: id 0 | task 5781 |
prompt eval time = 2412.80 ms / 804 tokens ( 3.00 ms per token, 333.22 tokens per second)
eval time = 5697.24 ms / 228 tokens ( 24.99 ms per token, 40.02 tokens per second)
total time = 8110.04 ms / 1032 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 6010 | processing task
slot update_slots: id 0 | task 6010 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 7353
slot update_slots: id 0 | task 6010 | kv cache rm [7287, end)
slot update_slots: id 0 | task 6010 | prompt processing progress, n_past = 7353, n_tokens = 66, progress = 0.008976
slot update_slots: id 0 | task 6010 | prompt done, n_past = 7353, n_tokens = 66
slot update_slots: id 0 | task 6010 | SWA checkpoint erase, pos_min = 6646, pos_max = 7286, size = 22.543 MiB
slot update_slots: id 0 | task 6010 | SWA checkpoint create, pos_min = 6746, pos_max = 7352, size = 21.347 MiB, total = 3/3 (61.263 MiB)
slot release: id 0 | task 6010 | stop processing: n_past = 7511, truncated = 0
slot print_timing: id 0 | task 6010 |
prompt eval time = 633.15 ms / 66 tokens ( 9.59 ms per token, 104.24 tokens per second)
eval time = 3963.12 ms / 159 tokens ( 24.93 ms per token, 40.12 tokens per second)
total time = 4596.28 ms / 225 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 6170 | processing task
slot update_slots: id 0 | task 6170 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 7419
slot update_slots: id 0 | task 6170 | kv cache rm [7353, end)
slot update_slots: id 0 | task 6170 | prompt processing progress, n_past = 7419, n_tokens = 66, progress = 0.008896
slot update_slots: id 0 | task 6170 | prompt done, n_past = 7419, n_tokens = 66
slot update_slots: id 0 | task 6170 | SWA checkpoint erase, pos_min = 6746, pos_max = 7352, size = 21.347 MiB
slot update_slots: id 0 | task 6170 | SWA checkpoint create, pos_min = 6746, pos_max = 7418, size = 23.668 MiB, total = 3/3 (67.558 MiB)
srv cancel_tasks: cancel task, id_task = 6170
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200
slot release: id 0 | task 6170 | stop processing: n_past = 7507, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot launch_slot_: id 0 | task 6261 | processing task
slot update_slots: id 0 | task 6261 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 7443
slot update_slots: id 0 | task 6261 | kv cache rm [7418, end)
slot update_slots: id 0 | task 6261 | prompt processing progress, n_past = 7443, n_tokens = 25, progress = 0.003359
slot update_slots: id 0 | task 6261 | prompt done, n_past = 7443, n_tokens = 25
slot update_slots: id 0 | task 6261 | SWA checkpoint erase, pos_min = 6746, pos_max = 7418, size = 23.668 MiB
slot update_slots: id 0 | task 6261 | SWA checkpoint create, pos_min = 6746, pos_max = 7442, size = 24.512 MiB, total = 3/3 (69.528 MiB)
slot release: id 0 | task 6261 | stop processing: n_past = 7594, truncated = 0
slot print_timing: id 0 | task 6261 |
prompt eval time = 294.81 ms / 25 tokens ( 11.79 ms per token, 84.80 tokens per second)
eval time = 3731.59 ms / 152 tokens ( 24.55 ms per token, 40.73 tokens per second)
total time = 4026.40 ms / 177 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 192.168.1.183 200