CUDA: FA optimization for models using SWA #752

ikawrakow · 2025-09-02T11:54:02Z

This PR is analogous to #702 and implements the optimization for the CUDA back-end.

Here performance comparisons between the main branch and this PR for Q4_0-quantized GPT-OSS-20B running on an RTX-4080 GPU:

Iwan Kawrakow added 3 commits September 2, 2025 13:01

Bounds for flash attention

c2500db

Add n_swa to FA parameters

be2694e

Fix it

32e223d

ikawrakow mentioned this pull request Sep 2, 2025

Refactor CUDA flash attention #745

Merged

This seems very slightly better

27e8ed6

ikawrakow mentioned this pull request Sep 3, 2025

Alternative CUDA FA for SWA models #754

Open

Provide feedback