Skip to content

Performance difference for Quantized Matmul between v2.6.3 and v3.x #4663

@SriAlavandar

Description

@SriAlavandar

I am performing benchmarking analysis using benchdnn between OneDNN v2.6.3 and OneDNN v3.10.2. I observed that v2.6.3 performs 10-12% better than v3.10.2 when running with a single thread (OMP_NUM_THREADS=1).
Data type combinations tested: u8:s8:u8 and u8:s8:f32

Here are the cmd I am using for this exp:
v2.6.3:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --cfg=u8s8u8 --bia_dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-oscale=per_oc:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt
v3.10.2:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --dt=u8:s8:u8 --bia-dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-scales=src:common:1.5+wei:per_oc+dst:common:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt

Please find some sample combinations and their behavior

M K N Dtype OneDNN v2.6.3 OneDNN v3.10.2 v2.6.3 / v3.10.2
200 13 512 u8:s8:u8 0.011 0.012 0.92
200 512 256 u8:s8:u8 0.054 0.056 0.96
200 256 128 u8:s8:f32 0.014 0.015 0.93
300 13 512 u8:s8:u8 0.016 0.018 0.89
300 512 256 u8:s8:u8 0.081 0.083 0.98
300 256 128 u8:s8:f32 0.021 0.022 0.95
400 13 512 u8:s8:u8 0.021 0.024 0.88

Here are the sample logs of these two experiments for reference:
v2.6.3:
onednn_verbose,exec,cpu,matmul,brg:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:1+dst:0:1 attr-post-ops:eltwise_relu ,,300x13:13x512:300x512,0.0158691

v3.10.2:
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:ap:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_relu,,300x13:13x512,0.0180664

Questions:
a. Is this expected behavior due to internal changes between v2.6 and v3.x?

Metadata

Metadata

Assignees

Labels

platform:cpu-x64Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64question

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions