Debug flash attention mixed-kv-cache issues #14

Zijie-Tian · 2025-06-25T07:53:30Z

A segment fault in the mixed KV cache implementation was resolved by addressing tensor allocation within the graph building process.

The llm_graph_input_attn_kv_mixed class in src/llama-graph.h was extended to include attn_state and attn_result ggml_tensor members.
In llm_graph_context::build_attn_inp_kv_mixed() within src/llama-graph.cpp, these tensors are now pre-allocated and marked as inputs using ggml_set_input(). This prevents issues with unmanaged tensor buffers.
The build_attn_mha_with_state() function was modified to accept and utilize these pre-allocated state and result tensors, removing their dynamic creation within the function.
A dimension mismatch in the pre-allocated attn_result tensor was corrected from [head_dim, seq_len, n_heads, n_batch] to [head_dim, n_heads, seq_len, n_batch] to match the output of ggml_flash_attn_ext_with_state.
The ggml_reshape_2d operation in build_attn_mha_with_state() was updated to cur->ne[0] * cur->ne[1], cur->ne[2] to correctly flatten the attn_result tensor based on its new dimensions.
The call to build_attn_mha_with_state in build_attn_mixed_with_state was updated to pass the pre-allocated tensors.

These changes resolved the initial segment fault and a subsequent dimension mismatch, allowing the mixed KV cache with stateful flash attention to function correctly.

…t tensors

Copilot

Pull Request Overview

This PR resolves a segfault and dimension mismatch in the mixed KV cache’s flash attention by pre-allocating state/result tensors and correcting their layout and reshape.

Added attn_state and attn_result members (with getters) to llm_graph_input_attn_kv_mixed
Pre-allocated and registered attn_state/attn_result in build_attn_inp_kv_mixed
Updated build_attn_mha_with_state to use pre-allocated tensors and fixed the reshape logic

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/llama-graph.h	Added `attn_state`/`attn_result` members and getters in `llm_graph_input_attn_kv_mixed`.
src/llama-graph.cpp	Pre-allocated tensors in `build_attn_inp_kv_mixed`, updated `build_attn_mha_with_state` signature, replaced dynamic allocation, and corrected reshape logic.

Comments suppressed due to low confidence (1)

src/llama-graph.h:326

The comment for attn_result lists dimensions as [head_dim, seq_len, n_heads, n_batch] but the tensor is created with order [head_dim, n_heads, seq_len, n_batch]. Please update the comment to match the actual dimension order.

    ggml_tensor * attn_result = nullptr; // F32 [head_dim, seq_len, n_heads, n_batch] output tensor

Copilot · 2025-06-25T07:54:28Z

src/llama-graph.cpp

+    // cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions
+    cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);


This reshape only accounts for head_dim * n_heads by seq_len, dropping the n_batch dimension. The reshape dimensions must multiply to the total element count (including batch). Consider using cur->ne[0] * cur->ne[1] and cur->ne[2] * cur->ne[3] to preserve the batch size.

Suggested change

// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions

cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);

// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions while preserving the batch size

cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2] * cur->ne[3]);

- Adjusted tensor dimensions and fixed the batch size in llama-graph to ensure compatibility with stateful flash attention. - Enhanced logging in llama-kv-cache-mixed to provide better insights during tensor operations, including checks for null pointers and buffer accessibility. - Updated masking logic to improve handling of quantized and FP16 tokens, ensuring correct application of attention masks.

Implement stateful flash attention with pre-allocated state and resul…

ba20af7

…t tensors

Zijie-Tian requested a review from Copilot June 25, 2025 07:53

Copilot AI reviewed Jun 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug flash attention mixed-kv-cache issues #14

Debug flash attention mixed-kv-cache issues #14

Zijie-Tian commented Jun 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions
		cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);

Debug flash attention mixed-kv-cache issues #14

Are you sure you want to change the base?

Debug flash attention mixed-kv-cache issues #14

Conversation

Zijie-Tian commented Jun 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants