Skip to content

Conversation

@chunghow-qti
Copy link

@chunghow-qti chunghow-qti commented Nov 21, 2025

Description

qnn::utils::TwoDimensionTranspose makes the bottleneck during session creation, because there is a double for loop memcpy. If the weight is quite large, it will be very slow. And it's called a total of 3 times by ReshapeGemmFusion.

QnnModel::ComposeGraph → ReshapeGemmFusion::AddToModelBuilder → CreateOrValidateOnQnn → qnn::utils::TwoDimensionTranspose
QNNExecutionProvider::GetCapability → QNNExecutionProvider::GetSupportedNodes → ReshapeGemmFusion::IsSupported → CreateOrValidateOnQnn → qnn::utils::TwoDimensionTranspose (do QNN OP validation)
QNNExecutionProvider::GetCapability → QNNExecutionProvider::GetSupportedNodes → onnxruntime::qnn::ReshapeGemmFusion::IsSupported → CreateOrValidateOnQnn → qnn::utils::TwoDimensionTranspose (do QNN OP validation)

This change avoid heavy memcpy by using a dummy tensor when only shape validation is required.

Motivation and Context

Function TwoDimensionTranspose_1 TwoDimensionTranspose_2 TwoDimensionTranspose_3 SessionCreationTime
original 88.39 ms 57.80 ms 53.09 m 9.41871 s
avoid 2 memcpy 51.52 ms 12.00 m 8.05 ms 9.05975 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant