Improve performance for quantized models on Power10 CPU #408
mgiessing
started this conversation in
Support for Targets (OS / EPs / Hardware)
Replies: 2 comments 2 replies
-
@mgiessing, int4 doesn't have special kernel for Power10 CPU yet. I don't have plan to support it yet. Do you want to contribute? |
Beta Was this translation helpful? Give feedback.
2 replies
-
@yufenglee is that something that would needed to be addressed within onnxruntime-genai or via onnxruntime mlas? I just synced with our optimization & kernel team and they confirmed we have only implemented Power10 MMA (matrix math accelerator engine) for fp32 & int8 in onnxruntime/mlas That would lead to another question: Are you going to support int8 quantization? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The quantized version of the sample model (phi-2) seem to have poor performance compared to fp32:
How to reproduce?
Preparation of the models:
Script to run the inference:
System information
OS: AlmaLinux 9.3
IBM Power10 CPU (ppc64le)
$ pip3 list installed | grep onnxruntime onnxruntime 1.17.3 onnxruntime-genai 0.2.0rc4
Anyone got an idea why the quantized performance is poor - I'd could assume there is some data type conversion adding overhead?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions