Debug flash attention mixed-kv-cache issues #14
+96
−43
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A segment fault in the mixed KV cache implementation was resolved by addressing tensor allocation within the graph building process.
llm_graph_input_attn_kv_mixedclass insrc/llama-graph.hwas extended to includeattn_stateandattn_resultggml_tensormembers.llm_graph_context::build_attn_inp_kv_mixed()withinsrc/llama-graph.cpp, these tensors are now pre-allocated and marked as inputs usingggml_set_input(). This prevents issues with unmanaged tensor buffers.build_attn_mha_with_state()function was modified to accept and utilize these pre-allocatedstateandresulttensors, removing their dynamic creation within the function.attn_resulttensor was corrected from[head_dim, seq_len, n_heads, n_batch]to[head_dim, n_heads, seq_len, n_batch]to match the output ofggml_flash_attn_ext_with_state.ggml_reshape_2doperation inbuild_attn_mha_with_state()was updated tocur->ne[0] * cur->ne[1], cur->ne[2]to correctly flatten theattn_resulttensor based on its new dimensions.build_attn_mha_with_stateinbuild_attn_mixed_with_statewas updated to pass the pre-allocated tensors.These changes resolved the initial segment fault and a subsequent dimension mismatch, allowing the mixed KV cache with stateful flash attention to function correctly.