-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CUDA: faster tile FA, add oob checks, more HSs #16492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: faster tile FA, add oob checks, more HSs #16492
Conversation
Been doing some tests with this branch and haven't noticed any problems so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the performance changes on gfx1030, and found no issues on brief testing.
From static analysis it looks correct but its a bit difficult to follow what the changes to the code in fattn-tile.cu are since this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.
I agree but in this case the changes to the kernel itself were relatively large anyways so I think it will need to be read in full either way. Generally speaking, would you prefer I link the relevant WIP branches in cases like this? |
Ideally a pr like this should simply have 2 commits, one with the organizational changes and one with the functional changes. If that is impractical due to how the changes came about, yes a note that where intermediate states can be looked at would help. |
* origin/master: (32 commits) metal : FA support F32 K and V and head size = 32 (ggml-org#16531) graph : support cacheless embeddings with FA and iSWA (ggml-org#16528) opencl: fix build targeting CL 2 (ggml-org#16554) CUDA: fix numerical issues in tile FA kernel (ggml-org#16540) ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520) CANN: fix CPU memory leak in CANN backend (ggml-org#16549) fix: add remark plugin to render raw HTML as literal text (ggml-org#16505) metal: add support for opt_step_sgd (ggml-org#16539) ggml : fix scalar path for computing norm (ggml-org#16558) CANN: Update several operators to support FP16 data format (ggml-org#16251) metal : add opt_step_adamw and op_sum (ggml-org#16529) webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506) [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521) ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532) common : handle unicode during partial json parsing (ggml-org#16526) common : update presets (ggml-org#16504) ggml : Fix FP16 ELU positive branch (ggml-org#16519) hparams : add check for layer index in is_recurrent (ggml-org#16511) ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518) CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492) ...
Changes:
ne11
to the mma kernel.fastdiv
in the FA kernels, because of this there are some combinations of GPUs, models, and batch sizes where there is a 1-2% performance regression. I intend to addfastdiv
once I have removed the WMMA kernel, I expect this to be fixed then. Also note that the granularity in terms of tokens is now being reduced by a factor equal to the GQA ratio so even in those cases there is now slightly less wasted compute.common.cuh
where if one were to compile code for CC 6.1 and then run it on a device with CC >= 7.0FAST_FP16_AVAILABLE
andfast_fp16_available
could be inconsistent.Performance