Skip to content

Commit f8e3614

Browse files
Add support for mixed precision activation quantization for UNet model (#365)
1 parent 200f255 commit f8e3614

File tree

5 files changed

+626
-21
lines changed

5 files changed

+626
-21
lines changed

README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,75 @@ An example `<selected-recipe-string-key>` would be `"recipe_4.50_bit_mixedpalett
246246

247247
</details>
248248

249+
## <a name="activation-quant"></a> Activation Quantization
250+
251+
<details>
252+
<summary> Details (Click to expand) </summary>
253+
254+
On newer hardware with A17 Pro or M4 chips, such as the iPhone 15 Pro, quantizing both activations and weight to int8 can leverage optimized compute on the Neural Engine which can be used to improve runtime latency in compute-bound models.
255+
256+
In this section, we demonstrate how to apply [Post Training Activation Quantization](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-algos.html#post-training-data-calibration-activation-quantization), using calibration data, on Stable Diffusion UNet model.
257+
258+
Similar to Mixed-Bit Palettization (MBP) described [above](#a-namecompression-lower-than-6-bitsa-advanced-weight-compression-lower-than-6-bits), first, a per-layer analysis is run to determine which intermediate activations are more sensitive to 8-bit compression.
259+
Less sensitive layers are weight and activation quantized (W8A8), whereas more sensitive layers are only weight quantized (W8A16).
260+
261+
Here are the steps for applying this technique:
262+
263+
**Step 1:** Generate calibration data
264+
265+
```python
266+
python -m python_coreml_stable_diffusion.activation_quantization --model-version <model-version> --generate-calibration-data -o <output-dir>
267+
```
268+
269+
A set of calibration text prompts are run through StableDiffusionPipeline and UNet model inputs are recorded and stored as pickle files in `calibration_data_<model-version>` folder inside specified output directory.
270+
271+
**Step 2:** Run layer-wise sensitivity analysis
272+
273+
```python
274+
python -m python_coreml_stable_diffusion.activation_quantization --model-version <model-version> --layerwise-sensitivity --calibration-nsamples <num-samples> -o <output-dir>
275+
```
276+
277+
This will run the analysis on all Convolutional and Attention (Einsum) modules in the model.
278+
For each module, a compressed version is generated by quantizing only that layer’s weights and activations.
279+
Then the PSNR between the outputs of the compressed and original model is calculated, using the same random seed and text prompts.
280+
281+
This analysis takes up to a few hours on a single GPU (cuda). The number of calibration samples used to quantize the model can be reduced to speed up the process.
282+
283+
The resulting JSON file looks like this:
284+
285+
```json
286+
{
287+
"conv": {
288+
"conv_in": 30.74,
289+
"down_blocks.0.attentions.0.proj_in": 38.93,
290+
"down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q": 48.15,
291+
"down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k": 50.13,
292+
"down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v": 45.70,
293+
"down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0": 39.56,
294+
...
295+
},
296+
"einsum": {
297+
"down_blocks.0.attentions.0.transformer_blocks.0.attn1.einsum": 25.34,
298+
"down_blocks.0.attentions.0.transformer_blocks.0.attn2.einsum": 31.76,
299+
"down_blocks.0.attentions.1.transformer_blocks.0.attn1.einsum": 23.40,
300+
"down_blocks.0.attentions.1.transformer_blocks.0.attn2.einsum": 31.56,
301+
...
302+
},
303+
"model_version": "stabilityai/stable-diffusion-2-1-base"
304+
}
305+
```
306+
307+
**Step 3:** Generate quantized model
308+
309+
Using calibration data and layer-wise sensitivity the quantized CoreML model can be generated as follows:
310+
311+
```python
312+
python -m python_coreml_stable_diffusion.activation_quantization --model-version <model-version> --quantize-pytorch --conv-psnr 38 --attn-psnr 26 -o <output-dir>
313+
```
314+
315+
The PSNR thresholds determine which layers will be activation quantized. This number can be tuned to trade-off between output quality and inference latency.
316+
317+
</details>
249318

250319
## <a name="using-stable-diffusion-3"></a> Using Stable Diffusion 3
251320

0 commit comments

Comments
 (0)