Skip to content

Commit a56e102

Browse files
atiorhTBPer
andcommitted
Fixes #243
Co-authored-by: TBPer <[email protected]>
1 parent ebff5f7 commit a56e102

File tree

1 file changed

+16
-13
lines changed

1 file changed

+16
-13
lines changed

README.md

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -55,24 +55,22 @@ Target Device Hardware Generation:
5555

5656
| Device | `--compute-unit`| `--attention-implementation` | End-to-End Latency (s) | Diffusion Speed (iter/s) |
5757
| --------------------- | --------------- | ---------------------------- | ---------------------- | ------------------------ |
58-
| iPhone 12 Mini | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 20 | 1.3 |
59-
| iPhone 12 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 17 | 1.4 |
60-
| iPhone 13 | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 15 | 1.7 |
61-
| iPhone 13 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 12 | 1.8 |
62-
| iPhone 14 | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 13 | 1.8 |
63-
| iPhone 14 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 9 | 2.3 |
64-
| iPad Pro (M1) | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 11 | 2.1 |
65-
| iPad Pro (M2) | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 8 | 2.9 |
66-
| Mac Studio (M1 Ultra) | `CPU_AND_GPU` | `ORIGINAL` | 4 | 6.3 |
67-
| Mac Studio (M2 Ultra) | `CPU_AND_GPU` | `ORIGINAL` | 3 | 7.6 |
58+
| iPhone 12 Mini | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 18.5 | 1.44 |
59+
| iPhone 12 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 15.4 | 1.45 |
60+
| iPhone 13 | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 10.8 | 2.53 |
61+
| iPhone 13 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 10.4 | 2.55 |
62+
| iPhone 14 | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 8.6 | 2.57 |
63+
| iPhone 14 Pro Max | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 7.9 | 2.69 |
64+
| iPad Pro (M1) | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 11.2 | 2.19 |
65+
| iPad Pro (M2) | `CPU_AND_NE` | `SPLIT_EINSUM_V2` | 7.0 | 3.07 |
6866

6967
<details>
7068
<summary> Details (Click to expand) </summary>
7169

72-
- This benchmark was conducted by Apple using public beta versions of iOS 17.0, iPadOS 17.0 and macOS 14.0 in June 2023.
70+
- This benchmark was conducted by Apple using public beta versions of iOS 17.0, iPadOS 17.0 and macOS 14.0 Seed 8 in August 2023.
7371
- The performance data was collected by running the `StableDiffusion` Swift pipeline.
7472
- Swift code is not fully optimized, introducing up to ~10% overhead unrelated to Core ML model execution.
75-
- The median latency value across 3 back-to-back end-to-end executions are reported
73+
- The median latency value across 5 back-to-back end-to-end executions are reported
7674
- The image generation procedure follows the standard configuration: 20 inference steps, 512x512 output image resolution, 77 text token sequence length, classifier-free guidance (batch size of 2 for unet).
7775
- The actual prompt length does not impact performance because the Core ML model is converted with a static shape that computes the forward pass for all of the 77 elements (`tokenizer.model_max_length`) in the text token sequence regardless of the actual length of the input text.
7876
- Weights are compressed to 6 bit precision. Please refer to [this section](#compression) for details.
@@ -81,7 +79,7 @@ Target Device Hardware Generation:
8179
- In the benchmark table, we report the best performing `--compute-unit` and `--attention-implementation` values per device. The former does not modify the Core ML model and can be applied during runtime. The latter modifies the Core ML model. Note that the best performing compute unit is model version and hardware-specific.
8280
- Note that the performance optimizations in this repository (e.g. `--attention-implementation`) are generally applicable to Transformers and not customized to Stable Diffusion. Better performance may be observed upon custom kernel tuning. Therefore, these numbers do not represent **peak** HW capability.
8381
- Performance may vary across different versions of Stable Diffusion due to architecture changes in the model itself. Each reported number is specific to the model version mentioned in that context.
84-
- Performance may vary due to factors like increased system load from other applications or suboptimal device thermal state. Given these factors, we do not report sub-second variance in latency.
82+
- Performance may vary due to factors like increased system load from other applications or suboptimal device thermal state.
8583

8684
</details>
8785

@@ -109,6 +107,9 @@ Target Device Hardware Generation:
109107

110108
## <a name="compression"></a> Weight Compression
111109

110+
<details>
111+
<summary> Details (Click to expand) </summary>
112+
112113
coremltools-7.0 supports advanced weight compression techniques for [pruning](https://coremltools.readme.io/v7.0/docs/pruning), [palettization](https://coremltools.readme.io/v7.0/docs/palettization-overview) and [linear 8-bit quantization](https://coremltools.readme.io/v7.0/docs/quantization-aware-training). For these techniques, `coremltools.optimize.torch.*` includes APIs that require fine-tuning to maintain accuracy at higher compression rates whereas `coremltools.optimize.coreml.*` includes APIs that are applied post-training and are data-free.
113114

114115
We demonstrate how data-free [post-training palettization](https://coremltools.readme.io/v7.0/docs/post-training-palettization) implemented in `coremltools.optimize.coreml.palettize_weights` enables us to achieve greatly improved performance for Stable Diffusion on mobile devices. This API implements the [Fast Exact k-Means](https://arxiv.org/abs/1701.07204) algorithm for optimal weight clustering which yields more accurate palettes. Using `--quantize-nbits {2,4,6,8}` during [conversion](#converting-models-to-coreml) is going to apply this compression to the unet and text_encoder models.
@@ -129,6 +130,8 @@ Resources:
129130
- [Core ML Tools Docs: Optimizing Models](https://coremltools.readme.io/v7.0/docs/optimizing-models)
130131
- [WWDC23 Session Video: Use Core ML Tools for machine learning model compression](https://developer.apple.com/videos/play/wwdc2023/10047)
131132

133+
</details>
134+
132135
## <a name="mbp"></a> MBP: Post-Training Mixed-Bit Palettization
133136

134137
<details>

0 commit comments

Comments
 (0)