CANN: Update several operators to support FP16 data format #16251

hipudding · 2025-09-25T13:20:03Z

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements.

In this change, get_rows, rms_norm, and flash_attn_ext are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios, with #16270

Make sure to read the contributing guidelines before submitting a PR

ggerganov · 2025-09-25T16:12:29Z

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

hipudding · 2025-09-26T02:38:58Z

Validation on the Qwen2 model shows correct accuracy and about 10% performance gain in concurrent scenarios.

Which model size is this speed up for?

Performance improved by 8%–10%. This result is based on our testing with the Qwen2.5 0.5B model using llama-parallel under 10 concurrent requests (we recently had a business case involving the 0.5B model). We also tested on Qwen2.5 7B, Qwen3-MoE, and DeepSeek V2-Lite, where we observed smaller performance gains.

On Ascend, operators such as FA and MulMAT are computed in FP16 precision. However, in llama.cpp, intermediate results default to FP32, which introduces a nontrivial casting overhead. Using FP16 for intermediate results can reduce this casting cost.

Of course, we also tried computing operators directly in FP32, but due to the higher computation cost, the performance was actually worse than the cast+FP16 approach.

This PR only modifies the operators so that they support both FP32 and FP16 data types. To fully adopt FP16 as the intermediate type, further changes are required in other parts of the code. I will submit an issue and a draft PR today to start a discussion on this. #16271

hipudding · 2025-09-26T02:48:51Z

Test pass for modified operators:

FLASH_ATTN_EXT
MUL_MAT
RMS_NORM
GET_ROWS

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]>

* origin/master: (32 commits) metal : FA support F32 K and V and head size = 32 (ggml-org#16531) graph : support cacheless embeddings with FA and iSWA (ggml-org#16528) opencl: fix build targeting CL 2 (ggml-org#16554) CUDA: fix numerical issues in tile FA kernel (ggml-org#16540) ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520) CANN: fix CPU memory leak in CANN backend (ggml-org#16549) fix: add remark plugin to render raw HTML as literal text (ggml-org#16505) metal: add support for opt_step_sgd (ggml-org#16539) ggml : fix scalar path for computing norm (ggml-org#16558) CANN: Update several operators to support FP16 data format (ggml-org#16251) metal : add opt_step_adamw and op_sum (ggml-org#16529) webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506) [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521) ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532) common : handle unicode during partial json parsing (ggml-org#16526) common : update presets (ggml-org#16504) ggml : Fix FP16 ELU positive branch (ggml-org#16519) hparams : add check for layer index in is_recurrent (ggml-org#16511) ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518) CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492) ...

…16251) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]>

hipudding force-pushed the fp16 branch from 982f194 to c7cbe5a Compare September 25, 2025 13:24

hipudding added Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 25, 2025

hipudding force-pushed the fp16 branch from c7cbe5a to 80a9455 Compare September 26, 2025 02:48

hipudding marked this pull request as ready for review September 26, 2025 02:49

noemotiovon mentioned this pull request Sep 26, 2025

CANN: add high performance mode using FP16 for intermediate states #16238

Closed

hipudding mentioned this pull request Sep 26, 2025

Support FP16 as intermediate results in graph computation #16270

Draft

hipudding self-assigned this Sep 26, 2025

hipudding force-pushed the fp16 branch from 80a9455 to bc59bb2 Compare October 9, 2025 07:57

noemotiovon approved these changes Oct 9, 2025 •

edited

Loading

View reviewed changes

noemotiovon self-requested a review October 9, 2025 09:21

noemotiovon approved these changes Oct 9, 2025

View reviewed changes

hipudding requested review from ggerganov and slaren October 10, 2025 07:24

ggerganov approved these changes Oct 12, 2025

View reviewed changes

hipudding merged commit f9bc66c into ggml-org:master Oct 13, 2025
69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CANN: Update several operators to support FP16 data format #16251

CANN: Update several operators to support FP16 data format #16251

Uh oh!

hipudding commented Sep 25, 2025 •

edited

Loading

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

hipudding commented Sep 26, 2025 •

edited

Loading

Uh oh!

hipudding commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CANN: Update several operators to support FP16 data format #16251

CANN: Update several operators to support FP16 data format #16251

Uh oh!

Conversation

hipudding commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 25, 2025

Uh oh!

hipudding commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hipudding commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hipudding commented Sep 25, 2025 •

edited

Loading

hipudding commented Sep 26, 2025 •

edited

Loading