Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 12, 2025

target #16528

  • Add Metal FA kernels for F32 K and V
  • Add Metal FA kernels for head size 32
  • Remove K and V casts with cacheless contexts (we should keep the casts for now)

Sample command for testing:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -e -p "$(printf 'hello %.0s' {1..510})" --pooling cls -c 512 -fa on

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 12, 2025
@ggerganov
Copy link
Member Author

ggerganov commented Oct 12, 2025

@JohannesGaessler @jeffbolznv Would it be possible to add support for F32 K and V tensors in the respective backends?

The issue is that these casts on master increase significantly the compute buffer size for embedding models (see #15586 (comment)):

llama.cpp/src/llama-graph.cpp

Lines 1313 to 1326 in 4b2dae3

// this can happen when KV cache is not used (e.g. an embedding model with non-causal attn)
if (k->type == GGML_TYPE_F32) {
k = ggml_cast(ctx0, k, GGML_TYPE_F16);
}
if (v->type == GGML_TYPE_F32) {
v = ggml_cast(ctx0, v, GGML_TYPE_F16);
}
cur = ggml_flash_attn_ext(ctx0, q, k, v, kq_mask, kq_scale, hparams.f_max_alibi_bias,
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
cb(cur, LLAMA_TENSOR_NAME_FATTN, il);

If we remove the casts, the memory usage should be significantly reduced for this use case. But to remove them, the FA implementation has to support k->type == GGML_TYPE_F32 && v->type == GGML_TYPE_F32 in the ggml_flash_attn_ext() operator.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 12, 2025

It's definitely possible but it will require additional considerations w.r.t. SRAM limits. For the tile kernel what would need to be done is to determine FP16 vs. FP32 use via a template parameter rather than the FAST_FP16_AVAILABLE macro and to determine which kernel parameters, if any, can be made to fit in SRAM if you now need twice as much to store KV data vs. FP16.

@JohannesGaessler
Copy link
Collaborator

Do the models for which this is relevant use GQA?

@JohannesGaessler
Copy link
Collaborator

We could also consider making the operations preceding FA write back their data as FP16 in the first place. In terms of performance that would definitely preferable for all CUDA/ROCm GPUs except for Pascal.

@ggerganov
Copy link
Member Author

Do the models for which this is relevant use GQA?

Generally yes.

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

@JohannesGaessler
Copy link
Collaborator

For CUDA that can definitely be done with comparatively little effort but it would not eliminate the additional memory use, it would just shift it from the compute buffer to the buffer pool in the CUDA backend.

@ggerganov ggerganov marked this pull request as ready for review October 12, 2025 13:58
@ggerganov ggerganov requested a review from slaren as a code owner October 12, 2025 13:58
@jeffbolznv
Copy link
Collaborator

I think this should be relatively straightforward in the vulkan backend, I'll look into it. This comment is how I'd expect to implement it (we dequantize while loading, so no extra memory usage):

If it would make the implementation simpler, maybe we can treat F32 K and V as just another "quantization" type, where the dequantize function is a cast to F16?

@jeffbolznv
Copy link
Collaborator

Done for Vulkan in #16543

@JohannesGaessler
Copy link
Collaborator

Basic CUDA support in #16546 .

@ggerganov ggerganov requested a review from CISC as a code owner October 13, 2025 14:30
Base automatically changed from gg/cacheless-embd to master October 13, 2025 19:42
@ggerganov ggerganov force-pushed the gg/metal-fa-f32-inp branch from f027196 to 252bb89 Compare October 13, 2025 19:49
@ggerganov ggerganov merged commit e60f241 into master Oct 13, 2025
1 check passed
@ggerganov ggerganov deleted the gg/metal-fa-f32-inp branch October 13, 2025 20:08
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 13, 2025
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants