Skip to content

Conversation

@Zijie-Tian
Copy link
Owner

A segment fault in the mixed KV cache implementation was resolved by addressing tensor allocation within the graph building process.

  • The llm_graph_input_attn_kv_mixed class in src/llama-graph.h was extended to include attn_state and attn_result ggml_tensor members.
  • In llm_graph_context::build_attn_inp_kv_mixed() within src/llama-graph.cpp, these tensors are now pre-allocated and marked as inputs using ggml_set_input(). This prevents issues with unmanaged tensor buffers.
  • The build_attn_mha_with_state() function was modified to accept and utilize these pre-allocated state and result tensors, removing their dynamic creation within the function.
  • A dimension mismatch in the pre-allocated attn_result tensor was corrected from [head_dim, seq_len, n_heads, n_batch] to [head_dim, n_heads, seq_len, n_batch] to match the output of ggml_flash_attn_ext_with_state.
  • The ggml_reshape_2d operation in build_attn_mha_with_state() was updated to cur->ne[0] * cur->ne[1], cur->ne[2] to correctly flatten the attn_result tensor based on its new dimensions.
  • The call to build_attn_mha_with_state in build_attn_mixed_with_state was updated to pass the pre-allocated tensors.

These changes resolved the initial segment fault and a subsequent dimension mismatch, allowing the mixed KV cache with stateful flash attention to function correctly.

@Zijie-Tian Zijie-Tian requested a review from Copilot June 25, 2025 07:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR resolves a segfault and dimension mismatch in the mixed KV cache’s flash attention by pre-allocating state/result tensors and correcting their layout and reshape.

  • Added attn_state and attn_result members (with getters) to llm_graph_input_attn_kv_mixed
  • Pre-allocated and registered attn_state/attn_result in build_attn_inp_kv_mixed
  • Updated build_attn_mha_with_state to use pre-allocated tensors and fixed the reshape logic

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/llama-graph.h Added attn_state/attn_result members and getters in llm_graph_input_attn_kv_mixed.
src/llama-graph.cpp Pre-allocated tensors in build_attn_inp_kv_mixed, updated build_attn_mha_with_state signature, replaced dynamic allocation, and corrected reshape logic.
Comments suppressed due to low confidence (1)

src/llama-graph.h:326

  • The comment for attn_result lists dimensions as [head_dim, seq_len, n_heads, n_batch] but the tensor is created with order [head_dim, n_heads, seq_len, n_batch]. Please update the comment to match the actual dimension order.
    ggml_tensor * attn_result = nullptr; // F32 [head_dim, seq_len, n_heads, n_batch] output tensor

Comment on lines +1922 to +1923
// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions
cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);
Copy link

Copilot AI Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reshape only accounts for head_dim * n_heads by seq_len, dropping the n_batch dimension. The reshape dimensions must multiply to the total element count (including batch). Consider using cur->ne[0] * cur->ne[1] and cur->ne[2] * cur->ne[3] to preserve the batch size.

Suggested change
// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions
cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2]);
// cur has dimensions [head_dim, n_heads, seq_len, n_batch], so we flatten the first two dimensions while preserving the batch size
cur = ggml_reshape_2d(ctx0, cur, cur->ne[0] * cur->ne[1], cur->ne[2] * cur->ne[3]);

Copilot uses AI. Check for mistakes.
- Adjusted tensor dimensions and fixed the batch size in llama-graph to ensure compatibility with stateful flash attention.
- Enhanced logging in llama-kv-cache-mixed to provide better insights during tensor operations, including checks for null pointers and buffer accessibility.
- Updated masking logic to improve handling of quantized and FP16 tokens, ensuring correct application of attention masks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants