You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On newer hardware with A17 Pro or M4 chips, such as the iPhone 15 Pro, quantizing both activations and weight to int8 can leverage optimized compute on the Neural Engine which can be used to improve runtime latency in compute-bound models.
255
+
256
+
In this section, we demonstrate how to apply [Post Training Activation Quantization](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-algos.html#post-training-data-calibration-activation-quantization), using calibration data, on Stable Diffusion UNet model.
257
+
258
+
Similar to Mixed-Bit Palettization (MBP) described [above](#a-namecompression-lower-than-6-bitsa-advanced-weight-compression-lower-than-6-bits), first, a per-layer analysis is run to determine which intermediate activations are more sensitive to 8-bit compression.
259
+
Less sensitive layers are weight and activation quantized (W8A8), whereas more sensitive layers are only weight quantized (W8A16).
A set of calibration text prompts are run through StableDiffusionPipeline and UNet model inputs are recorded and stored as pickle files in `calibration_data_<model-version>` folder inside specified output directory.
This will run the analysis on all Convolutional and Attention (Einsum) modules in the model.
278
+
For each module, a compressed version is generated by quantizing only that layer’s weights and activations.
279
+
Then the PSNR between the outputs of the compressed and original model is calculated, using the same random seed and text prompts.
280
+
281
+
This analysis takes up to a few hours on a single GPU (cuda). The number of calibration samples used to quantize the model can be reduced to speed up the process.
The PSNR thresholds determine which layers will be activation quantized. This number can be tuned to trade-off between output quality and inference latency.
316
+
317
+
</details>
249
318
250
319
## <aname="using-stable-diffusion-3"></a> Using Stable Diffusion 3
0 commit comments