You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
serialize scales as bf16 and serialize in Named Data Map (#11031)
XNNPACK Currently uses BF16 scales for running GEMMS with groupwise
quantized weights. Currently we serialize scales as FP32, and then
convert them to BF16 before passing to XNNPACK. We can save both memory
and file size by serializing the scales as BF16 first.
As an additional step here, we move the serialization of scales both for
channelwise and groupwise quantized weights into the named data map. In
the future, if we want to swap data that could be a potential feature
because scales are no longer tied to the XNNPACK payload but can be
swappable through the ptd file.
cc @lucylq for the scale serialization
### Llama Experiments
```
-rw-r--r-- 1 maxren staff 1746392320 May 20 16:49 llama3_fp32_scales.pte
-rw-r--r-- 1 maxren staff 1707798912 May 20 18:47 llama3_bf16_scales.pte
```
we see ~40 mb reduction in model size.
0 commit comments