What happened?
I managed to reproduce the bug that was mentioned in #645 without a draft model at all. With my 3090 on Windows
.\llama-server.exe -m "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" -t 6 -ngl 99 -ctk q8_0
Name and Version
version: 3766 (cac763f)
built with MSVC 19.28.29335.0 for x64
and
version: 3852 (a694d7d) AKA #645
built with MSVC 19.28.29335.0 for x64
What operating system are you seeing the problem on?
Windows
Relevant log output
ik_llama.cpp\ggml\src\ggml-cuda\mmvq.cu:595: GGML_ASSERT(src0->ne[2] == src1->ne[2] && src0->ne[2] == dst->ne[2]) failed