Skip to content

Eval bug: Dense model with draft model cause crash #16980

@Kaspur2012

Description

@Kaspur2012

Name and Version

b6924 works fine
b6927-40 causes crash with draft model loaded.

If I unload the draft model, it works just fine.

Image Image

Operating systems

Windows

GGML backends

CUDA

Hardware

AMD Ryzen 7 3700X 8 core

Models

Llama-3.3-70B-Instruct-UD-IQ3_XXS,
nvidia_Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S
Qwen3-VL-32B-Instruct.Q6_K when test on text only

Problem description & steps to reproduce

so I always download the latest binary and noticed that recently when I tried to play with llama 3.3 70b with draft it just crash right after I send the prompt from llama webui. I went ahead and download all version from my previous working one(b6924) to the latest(b6940) and noticed that it starts crashing on b6927 onward to 40. If I remove all draft model related parameters, then it works just fine on the later version.

First Bad Commit

b6927

Relevant log output

main: server is listening on http://127.0.0.1:8080 - starting the main loop
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
srv  kv_cache_cle: clearing KV cache
que    start_loop: waiting for new tasks

srv  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0, front = 1
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
srv  process_sing: n_idle_slots = 2, n_processing_slots = 0
srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
srv  remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
srv  log_server_r: request: GET /slots 127.0.0.1 200
srv  log_server_r: request:  
srv  log_server_r: response: [{"id":0,"n_ctx":12000,"speculative":true,"is_processing":false},{"id":1,"n_ctx":12000,"speculative":true,"is_processing":false}]
request: {"messages":[{"role":"user","content":[{"type":"text","text":"\n\n--- File: Pasted ---\n**Task:**  \r\nCreate a ranked list of the provided LLM models from **fastest to slowest** based on their reported **tokens per second (t/s)**.  \r\n\r\n**Persona:**  \r\nYou are an expert LLM performance analyst.  \r\n\r\n**Output Format:**  \r\nPresent the results as a **markdown table** with the following columns, preserving **all characteristics** exactly as given:\r\n\r\n| Rank | Model Name | Storage Size | CUDA Version | Context Length | Offload Type | CPU Thread Pool Size | Evaluation Batch Size | Offload KV Cache | Flash Attention | K‑Cache Quantized | V‑Cache Quantized | Tokens per Second |\r\n|------|------------|--------------|--------------|----------------|--------------|----------------------|-----------------------|------------------|-----------------|-------------------|-------------------|--------------------|\r\n\r\n- **Rank 1** should be the model with the highest t/s, **Rank N** the lowest.  \r\n- If two models have identical t/s, order them alphabetically by model name.  \r\n- For models lacking a specific attribute (e.g., K‑Cache or V‑Cache Quantized, Offload KV Cache), the corresponding table cell should be left blank.\r\n\r\n\r\n**Constraints & Guidance:**  \r\n- Include **every** model and **all** listed attributes; do not omit or modify any values.  \r\n- Explicitly state “Verified: all entries sorted correctly by tokens per second.” at the end of the table.  \r\n- If any required information is ambiguous or missing, note it in a separate “Assumptions / Issues” section below the table.  \r\n\r\n---  \r\n\r\n*Paste the raw model list (as in the original prompt) below this line before running the LLM.* \r\n \r\n```\r\nmistral-small-3.2-24b-instruct-2506 Q6_K_XL - 22.54 GB - cuda 12 - 45k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 41t/s\r\nQwen3-32B-UD-Q4_K_XL - 20.02 GB - cuda 12 - 35k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 39t/s\r\nQwen3-Nemotron-32B-RLBFF.i1-Q4_K_M - 19.76 GB - cuda 12 - 30k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 30t/s\r\nQwen3-VL-32B-Instruct.Q4_K_M_TEXT - 20.96 GB - cuda 12 - 30k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 35t/s\r\nQwen3-VL-32B-Instruct.Q4_K_M_TEXT_DRAFT - 20.96 GB - cuda 12 - 30k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 50t/s\r\nQwen3-VL-32B-Thinking.Q4_K_M_TEXT_DRAFT - 20.96 GB - cuda 12 - 15k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 35t/s\r\nQwen3-VL-32B-Instruct.Q6_K_TEXT_DRAFT - 28.08 GB - cuda 12 - 15k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 37t/s\r\nQwen3-VL-32B-Instruct.Q4_K_M_VL - 20.96 GB - cuda 12 - 17k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 37t/s\r\nLlama-3.3-70B-Instruct-UD-IQ3_XXS - 27.66 GB - cuda 12 - 10k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - speculative decoding 1b - 20t/s\r\ngemma-3-27b-it-UD-Q6_K_XL - 25.41 GB - cuda - 130k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 25t/s\r\nQwen3-14B-128K-UD-Q8_K_XL - 18.75 GB - cuda 12 - 80k ctx - full offload - 8 - 2048 - offload kv checked - flash - kcache - vcache - 48t/s\r\nnvidia_llama-3_3-nemotron-super-49b-v1_5 - 28.63 GB - cuda 12 - 31k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 21t/s\r\nqwen3-30b-a3b-instruct-2507@q4_k_m - 18.56 GB - cuda 12 - 110k ctx - full offload - 8 - 2048 - offload kv checked - flash - 136t/s\r\nGLM-Z1-32B-0414-Q4_K_M - 19.68 GB - cuda 12 - 32k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 40t/s\r\nUIGEN-X-32B-0727.Q4_K_M - 19.76 GB - cuda 12 - 19k ctx - full offload - 8 - 512- offload kv checked - flash - 39t/s\r\ngemma-3-27b-it-qat-UD-Q4_K_XL - 18.51 GB - cuda 12 - 131k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 42t/s\r\ngpt-oss-20b-MXFP4 - 12.11 GB - cuda 12 - 131k ctx - full offload - 8 - 2048 - offload kv checked - flash -  170t/s\r\nDevstral-Small-2507-Q6_K - 19.35 GB - cuda 12 - 64k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 39t/s\r\nGLM-4.5-Air-UD-Q3_K_XL - 57.73 GB - cuda 12 - 27k ctx - 19/47 layer offload to gpu - 8 - 512 - flash - kcache - vcache - 7t/s\r\nQwen3-Coder-30B-A3B-Instruct-UD-Q6_K_XL - 26.34 GB - cuda 12 - 75k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache- 79t/s\r\nSeed-OSS-36B-Instruct-UD-Q4_K_XL - 22.03 GB - cuda 12 - 21k ctx - full offload - 8 - 512 - offload kv checked - flash  - kcache - vcache - 35t/s\r\nMagistral-Small-2509-UD-Q5_K_XL - 18.52 GB - cuda 12 - 85k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache - 42t/s\r\ngpt-oss-120b-MXFP4 - 63.39 GB - cuda 12 - 45k ctx - 16/36 layer offload to gpu - 8 - 512 - flash - kcache - vcache - 9t/s\r\nQwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL_TEXT - 19.87 GB - cuda 12 - 70k ctx - full offload - 8 - 512 - offload kv checked - flash - kcache - vcache- 133t/s\r\n\r\n```"}]}],"stream":true,"reasoning_format":"auto","temperature":0.6,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1.1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":12000,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"timings_per_token":true}


================================================================================
--- Process Finished ---

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions