Accuracy requirement for new backends with non-standard floating point behaviour #16480

marty1885 · 2025-10-09T08:03:08Z

marty1885
Oct 9, 2025

Hi all,

I have been working on a GGML backend for Tenstorrent processors (link) on and off for a while. I plan on upstreaming the backend in the near-to-mid future as it is starting to provide real benigits - fixing a few things and writing missing kernels right now. However their processors have a few quarks in terms of floating point behavior, trading for speed.

Tensors on their platform can support up to FP32. But the matrix engine (used for matmul and reductions) only supports up to BFP16 and automatically downcast to BFP16 (in hardware) if input is FP32 (the vector engine does support FP32. But it is used for activation functions only). Leading to lower then expected accuracy for operations as MUL_MAT, RMS_NORM, GROUP_NORM, SOFT_MAX and alike. This doesn't seem to affect LLM inference and maintains a good output quality. But does make the backend fail the test-backend-ops testsuite.

Furthermore, due to the matrix engine limitation, it might be more beneficial for the backend to use BFP16 even when GGML asks for FP32. It would reduce memory footprint while the bulk of the math will be capped at BFP16 anyway.

Would this be a blocker for upstreaming? And what could I do to reduce the issue my backend could cause?

Thank you!
Martin

Edit: For context, I have to relax max_nmse_err of some tests. But it is not terrible, at most ~1e-4 is good enough to make most tests pass.

0cc4m · 2025-10-09T08:13:24Z

0cc4m
Oct 9, 2025
Collaborator

Does that mean Tenstorrent also uses bf16 for matmul accumulators? There are models where at least fp16 is not enough, in specific operations. Those are marked with the precision flag GGML_PREC_F32. We occasionally have trouble with that with the Vulkan backend, because it actually uses fp16 accumulators by default, unlike CPU/CUDA/Metal. Not sure if that would also affect bf16.

2 replies

marty1885 Oct 9, 2025
Author

They use BFP16 for input with optional setting to decide if accumulation should be done in FP32 or BFP16 (with performance tradeoffs).

slaren Oct 9, 2025
Maintainer

If the results of the LLMs is still good with BF16 then this is not a problem, but we would need to find a better way to set the error limit in the tests. And I believe that even in the cases where we need 32-bit accumulators with GGML_PREC_F32, this is only necessary due to values exceeding the range of FP16, not because we really need the additional precision. So a BF16 accumulator may also work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accuracy requirement for new backends with non-standard floating point behaviour #16480

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Accuracy requirement for new backends with non-standard floating point behaviour #16480

Uh oh!

Uh oh!

marty1885 Oct 9, 2025

Replies: 1 comment · 2 replies

Uh oh!

0cc4m Oct 9, 2025 Collaborator

Uh oh!

marty1885 Oct 9, 2025 Author

Uh oh!

slaren Oct 9, 2025 Maintainer

marty1885
Oct 9, 2025

Replies: 1 comment 2 replies

0cc4m
Oct 9, 2025
Collaborator

marty1885 Oct 9, 2025
Author

slaren Oct 9, 2025
Maintainer