Skip to content

Bug: GGML_ASSERT when running quantized K cache on CUDA with no fa #679

@saood06

Description

@saood06

What happened?

I managed to reproduce the bug that was mentioned in #645 without a draft model at all. With my 3090 on Windows

.\llama-server.exe -m "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf" -t 6 -ngl 99 -ctk q8_0

Name and Version

version: 3766 (cac763f)
built with MSVC 19.28.29335.0 for x64

and

version: 3852 (a694d7d) AKA #645
built with MSVC 19.28.29335.0 for x64

What operating system are you seeing the problem on?

Windows

Relevant log output

ik_llama.cpp\ggml\src\ggml-cuda\mmvq.cu:595: GGML_ASSERT(src0->ne[2] == src1->ne[2] && src0->ne[2] == dst->ne[2]) failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions