You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was reading the implementation of MXFP4/INT8 dot product for gpt-oss, and it occurred to me. Since it's memory bound, we could get free precision by accumulating in fp64, or using kahan summation, or both, and it wouldn't be any slower. This is probably the case every place there's an accumulate.
Which does llama.cpp care about more, implementing the models faithfully, or getting extra free precision for better outputs?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I was reading the implementation of MXFP4/INT8 dot product for gpt-oss, and it occurred to me. Since it's memory bound, we could get free precision by accumulating in fp64, or using kahan summation, or both, and it wouldn't be any slower. This is probably the case every place there's an accumulate.
Which does llama.cpp care about more, implementing the models faithfully, or getting extra free precision for better outputs?
Beta Was this translation helpful? Give feedback.
All reactions