Replies: 1 comment 2 replies
-
Does that mean Tenstorrent also uses bf16 for matmul accumulators? There are models where at least fp16 is not enough, in specific operations. Those are marked with the precision flag |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I have been working on a GGML backend for Tenstorrent processors (link) on and off for a while. I plan on upstreaming the backend in the near-to-mid future as it is starting to provide real benigits - fixing a few things and writing missing kernels right now. However their processors have a few quarks in terms of floating point behavior, trading for speed.
Tensors on their platform can support up to FP32. But the matrix engine (used for matmul and reductions) only supports up to BFP16 and automatically downcast to BFP16 (in hardware) if input is FP32 (the vector engine does support FP32. But it is used for activation functions only). Leading to lower then expected accuracy for operations as
MUL_MAT
,RMS_NORM
,GROUP_NORM
,SOFT_MAX
and alike. This doesn't seem to affect LLM inference and maintains a good output quality. But does make the backend fail thetest-backend-ops
testsuite.Furthermore, due to the matrix engine limitation, it might be more beneficial for the backend to use BFP16 even when GGML asks for FP32. It would reduce memory footprint while the bulk of the math will be capped at BFP16 anyway.
Would this be a blocker for upstreaming? And what could I do to reduce the issue my backend could cause?
Thank you!
Martin
Edit: For context, I have to relax
max_nmse_err
of some tests. But it is not terrible, at most~1e-4
is good enough to make most tests pass.Beta Was this translation helpful? Give feedback.
All reactions