[XNNPACK Quantizer] Select between TConvs and Convs #11863

GregoryComer · 2025-06-23T23:14:27Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #11732 by @mcr229
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/mcr229/33/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/33/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/mcr229/32/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/33/orig
@diff-train-skip-merge

…uple outputs (#11647) ### Summary This PR fixes the `channels_last_tagged_reshape_pass.py` to properly handle tuple outputs with mixed memory formats. Previously, the pass only checked and converted the first element of tuple outputs, which could lead to incorrect memory formats for other elements in the tuple. This fix is important for models that return multiple outputs with different memory format requirements, such as a mix of convolution outputs (which should be in NHWC format) and linear outputs (which should be in standard format). ### Test plan I added a new test class `ThreeOutputsModel` that has three outputs with different memory format requirements. I ensured that this test output given NCHW and NHWC inputs would evaluate properly. I also created a simpler 2 input class `ConvAddConvOutput` which operated on different inputs and returned two different dim order outputs.

Differential Revision: D76737404 Pull Request resolved: #11727

…#11659)

Differential Revision: D76469624 Pull Request resolved: #11577

### Summary Fixed linter error. ### Test plan CI Co-authored-by: Guang Yang <[email protected]>

#11745) ### Summary Running `install_dev.py` for `optimum-executorch` will force overriding installed `executorch` and torch deps to the pinned nightly in `optimum-executorch`. In ExecuTorch CI including the benchmark, we would want to always run the optimum-executorch models with ExecuTorch from source to catch issues/regressions. ### Test plan Verified the installed deps in the CI and benchmark jobs Co-authored-by: Guang Yang <[email protected]>

@pytorchbot

### Summary 1. Update MediaTek backend documents for the decoupled buffer allocator. 2. Follow backend template. 3. Remove unnecessary instructions. Fixes #8532 @pytorchbot label "partner: mediatek"

Differential Revision: D76745314 Pull Request resolved: #11739

As titled, this API allows us to support multi-turn conversation by passing in a `start_pos` argument to `generate_from_pos`. This pull request introduces a new feature to support text generation from a specific starting position (`generate_from_pos`) and includes updates to ensure proper error handling and functionality when `max_new_tokens` is negative. The changes primarily focus on extending the `TextLLMRunner` class and its associated methods to accommodate this new feature while maintaining backward compatibility. ### New Feature: Text Generation from a Specific Starting Position * **Added `generate_from_pos` Method**: Introduced a new method `generate_from_pos` in `TextLLMRunner` to allow text generation starting from a specified position in the KV cache. This includes updates to the method signature, logic, and error handling. (`extension/llm/runner/text_llm_runner.cpp` [[1]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440L76-R78) [[2]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R129-R156) [[3]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440L150-R165) [[4]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R219-R225); `extension/llm/runner/text_llm_runner.h` [[5]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bR98-R122) * **Updated Documentation**: Enhanced method documentation in `TextLLMRunner` to describe the new functionality, including parameters like `start_pos` and the expected behavior. (`extension/llm/runner/text_llm_runner.h` [[1]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bL81-R83) [[2]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bR98-R122) ### Error Handling Improvements * **Validation for `max_new_tokens`**: Added checks to ensure `max_new_tokens` is positive. If it is not, an `InvalidArgument` error is returned. This prevents invalid configurations during text generation. (`extension/llm/runner/text_llm_runner.cpp` [extension/llm/runner/text_llm_runner.cppR129-R156](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R129-R156)) * **Unit Test for Negative `max_new_tokens`**: Created a new test case (`GenerateFromPosErrorsWithNegativeMaxNewTokens`) to verify that the `generate_from_pos` method correctly handles scenarios where `max_new_tokens` is negative. (`extension/llm/runner/test/test_text_llm_runner.cpp` [extension/llm/runner/test/test_text_llm_runner.cppR325-R379](diffhunk://#diff-0a1e69b4182878ccad887c4f4ba3929ef24082a26623e26a871d73f4e6cea503R325-R379))

…1724) Arm backend: Added decomposition for MaxPool2D operator with dilation > 0 Signed-off-by: Elena Zhelezina <[email protected]>

@digantdesai

- Adds support for per-channel quantization in TosaQuantizer and TosaBackend - Enables per-channel quantization for MobilneNetV2 test cases cc @digantdesai @freddan80 @per @zingo --------- Signed-off-by: Oscar Andersson <[email protected]>

The introduction of decomposition for linalg vector norm revealed a bug that when dim is None, then all dimensions should be reduced. Signed-off-by: Elena Zhelezina <[email protected]>

Differential Revision: D76746854 Pull Request resolved: #11751

Differential Revision: D76791781 Pull Request resolved: #11750

### Summary This PR uses `xnn_define_binary` and `xnn_define_unary` to define XNNPack ops, instead of separately calling the individual definitions. Further changes: 1. Removes individual node definitions for unary and binary ops 2. Creates a wrapper macro to generate function defs for individual ops using `xnn_define_binary` and `xnn_define_unary` inside. Fixes #11584 ### Test plan ``` ## Build steps cmake -DEXECUTORCH_BUILD_XNNPACK=ON .. cmake --build cmake-out -j9 Tests ran: ./test/run_oss_cpp_tests.sh . . . 100% tests passed, 0 tests failed out of 86 ```

@haowhsu-quic

…1546) ### Summary This PR consists of 4 Encoder-Only models. Following stats are based on SM8750. 1. Albert (16a16w) - Accuracy: ~22% (NOTE: nn.Module accuracy is around 24%, so the similarity between QNN and nn.Module is around 92%) - Speed: 11ms/inf - Script: `python examples/qualcomm/oss_scripts/albert.py -b build-android -s $DEVICE -m SM8750 --dataset ../wikipedia-sentences/wikisent2.txt` 2. Bert (16a8w) - Accuracy: ~60% - Speed: 9ms/inf - Script: `python examples/qualcomm/oss_scripts/bert.py -b build-android -s $DEVICE -m SM8750 --dataset ../wikipedia-sentences/wikisent2.txt` 3. Distilbert (16a8w) - Accuracy: ~59% - Speed: 8ms/inf - Script: `python examples/qualcomm/oss_scripts/distilbert.py -b build-android -s $DEVICE -m SM8750 --dataset ../wikipedia-sentences/wikisent2.txt` 4. Eurobert (16a16w) - Accuracy: ~54% - Speed: 40ms/inf - Script: `python examples/qualcomm/oss_scripts/eurobert.py -b build-android -s $DEVICE -m SM8750 --dataset ../wikipedia-sentences/wikisent2.txt` ### Test plan - E2E Scripts under `test_qnn_delegate.py` - Example script: `python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleOssScript.test_{BERT_MODEL} --model SM8750 -s $DEVICE --build_folder build-android/ -r ./ -a ./test --sentence_dataset ../wikipedia-sentences/wikisent2.txt` - Mainline CI Author: @haowhsu-quic, @chunit-quic, @winskuo-quic

) ### Summary - delete convert_bmm_to_matmul pass - add torch.ops.aten.matmul.default in skip_decomp_table ### Test plan General CI

Differential Revision: D76781331 Pull Request resolved: #11759

#11596) ### Summary Refactor the XNNPACK tester to split out reusable base components from XNNPACK-specific parts. I've relocated the base classes to backends/test/harness. I've kept the tester structure pretty much unchanged, except for replacing stage names with an enum. It looks like Arm tests are directly importing for XNNPACK's tester currently. Ideally, we'll want to refactor to have their own stage implementations, but I've left that as a follow-up to minimize changes for the initial refactor. ### Test plan CI

@swolchok

… fbsource sleef (#11261)" (#11765) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #11657 by @swolchok ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/swolchok/458/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/swolchok/458/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/swolchok/458/orig @diff-train-skip-merge Co-authored-by: Scott Wolchok <[email protected]>

@ahmtox

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #11369 by @ahmtox ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/ahmtox/11/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/11/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/ahmtox/11/orig @diff-train-skip-merge Co-authored-by: morelos <[email protected]>

Creating the dequantize_per_tensor and dequantize_per_token logic shaders and impl which are linked with the testing framework. Differential Revision: [D76267107](https://our.internmc.facebook.com/intern/diff/D76267107/) [ghstack-poisoned]

Creating the choose_qparams per_tensor and per_token logic shaders and impl which are linked with the testing framework Differential Revision: [D76436933](https://our.internmc.facebook.com/intern/diff/D76436933/) [ghstack-poisoned]

Differential Revision: D76842266 Pull Request resolved: #11764

Differential Revision: D76483572 Pull Request resolved: #11592

…hapes Differential Revision: D76530379 Pull Request resolved: #11611

@cccclai

…11778) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #11757 by @cccclai ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/cccclai/28/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/cccclai/28/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/cccclai/28/orig @diff-train-skip-merge Co-authored-by: Chen Lai <[email protected]>

Differential Revision: D76781745 Pull Request resolved: #11746

) - Constant placeholders with same values but different data types, such as int32 and fp32, shouldn't be fused into a single placeholder. Otherwise, some operators will have operands with mismatched dtypes. - Fix the bug by adding a dtype check to fuse only constants with matching types and same values. Signed-off-by: Yufeng Shi <[email protected]>

Differential Revision: D76954785 Pull Request resolved: #11824

# Summary Provide methods and script to fetch all execuTorch benchamrk data from HUD API into two dataset,private and public, the script will: - fetch all data from HUD API from input time range in UTC - clean out records and tables with only FAILURE_REPORT due to job-level failures - get all private table metrics, generate `table_name` and find intersected public table metrics - generate private and public table groups - output data OutputType: - run with excel-sheet export - run with csv export - run with dataframe format print - run with json format print See more guidance in README.md the data is similar to the excel sheet generated manually in #10982 The result should be the same as the hud per model datatable: <img width="1480" alt="image" src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3" /> ## helper methods: common.py provide common.py helper method to convert back csv and excel sheets back to {"groupInfo":{}, "df":df.DataFrame} format. # run with ``` bash python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \ --startTime "2025-04-29T09:48:57" \ --endTime "2025-05-13T22:00:00" \ --outputType "excel" \ --models "mv3" python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \ --primary-file private.xlsx \ --reference-file public.xlsx ``` Generate excel files: [private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx) [public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx) For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14: ``` Latency Stability Analysis: table10 (Primary) ================================================================================ Model: mv3(xnnpack_q8) Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14) Dataset Overview: - Number of samples: 88 - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00 Central Tendency Metrics: - Mean latency: 2.91 ms - Median latency (P50): 2.54 ms - Mean trimmed latency: 2.41 ms - Median trimmed latency: 2.15 ms Dispersion Metrics: - Standard deviation: 1.14 ms - Coefficient of variation (CV): 39.08% - Interquartile range (IQR): 0.82 ms - Trimmed standard deviation: 0.76 ms - Trimmed coefficient of variation: 31.60% Percentile Metrics: - P50 (median): 2.54 ms - P90: 3.88 ms - P95: 4.60 ms - P99: 5.91 ms Inter-Jitter Metrics (variability between runs): - Max/Min ratio: 5.6103 - P99/P50 ratio: 2.3319 - Mean rolling std (window=5): 0.79 ms Intra-Jitter Metrics (variability within runs): - Mean trimming effect ratio: 15.37% - Max trimming effect ratio: 38.83% Stability Assessment: - Overall stability score: 0.0/100 - Overall stability rating: Poor Interpretation: The benchmark shows poor stability (score: 0.0/100) with significant variation between runs (CV: 39.08%). Performance is unpredictable and may lead to inconsistent user experience. The significant difference between raw and trimmed means suggests considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs. The max/min ratio of 5.61 indicates substantial performance differences between the best and worst runs. The P99/P50 ratio of 2.33 suggests occasional latency spikes that could affect tail latency sensitive applications. ``` --------- Signed-off-by: Yang Wang <[email protected]>

…diate outputs Differential Revision: D76831086 Pull Request resolved: #11855

@mcr229

…ups==1 (#11774) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #11730 by @mcr229 ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/mcr229/31/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/31/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/31/orig @diff-train-skip-merge --------- Co-authored-by: Max Ren <[email protected]> Co-authored-by: Gregory Comer <[email protected]>

Fixes some bugs with how enum fields are used.

Update documentation to use the new `export_llm` instead of the old `export_llama`.

pytorch-bot · 2025-06-23T23:14:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11863

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b7572d0 with merge base 0c12dcd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

### Summary This PR adds support for the tanh operator in ExecuTorch via XNNPACK, enabling optimized execution of torch.tanh on the XNNPACK backend. The implementation includes updates to operator configuration, serialization, and runtime handling. The tanh operator is now properly registered in the XNNPACK partition config and mapped to XNNPACK's xnn_create_tanh_operator API in the compiler. ### Test plan I added a new test class TestTanh that is a simple torch model with a tanh op. It then asserts that the XNNPACK delegate was called while executing the tanh op instead of the torch default tanh op.

…ups==1 Pull Request resolved: #11730 Supporting Quantized Transposed Convs with Groups being 1. Previously, There was some added support for Quantized Transposed Convolutions but only when the channel axis is 1 and when the groups is 1. The current Quantizer didn't support this because it only allows quantizaing along the zero dim, which is generally the output channels. However for TransposedConvs, the dimension of the weights are: ``` [in_channels, out_channels/groups, h, w] ``` Since we want to keep quantization along the output channels, we now need to quantize along axis = 1. The reason we require groups to be one is because XNNPACK takes in filters of the dimension: ``` [out_channels, H, W, in_channels/groups] ``` Since we are quantizing along the output channels, in pytorch we expect to have out_channels/groups scales, but in xnnpack we have out_channels scales! Realistically we would need to support this with some affine quantization, where we provide a scale for every group, every out_channel. However for now, we just ensure the constraint where groups == 1. ghstack-source-id: 291033630 @exported-using-ghexport Differential Revision: [D76631781](https://our.internmc.facebook.com/intern/diff/D76631781/)

…groups ==1 Pull Request resolved: #11731 Here we support dynamically quantized Deconvolutions. There is some refactoring of the previous diff, but in general, we just remove the constraint in the Dynamism check that the convolution isn't transposed. For the same reasons as before, this only supports channel_axis = 1 and groups = 1. ghstack-source-id: 291033632 @exported-using-ghexport Differential Revision: [D76638904](https://our.internmc.facebook.com/intern/diff/D76638904/)

Pull Request resolved: #11732 Allow selection of Difference between transposed convs and regular convs. Previously, we grouped all conv targets together (transposed and regular convs), but now we enable better per-operator selection ghstack-source-id: 291033631 Differential Revision: [D76641838](https://our.internmc.facebook.com/intern/diff/D76641838/)

github-actions · 2025-09-02T00:51:37Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

leafs1 and others added 30 commits June 16, 2025 14:13

Revert vulkan changes from D76646172 fixup patch

2d09ab8

Differential Revision: D76737404 Pull Request resolved: #11727

Update CMakeLists.txt for extension/apple to strip debug symbols path (…

3a6c664

…#11659)

Implement ReplaceMulTensorWithMulAndFullOpsPass.

9eb8d01

Differential Revision: D76469624 Pull Request resolved: #11577

Fixed linter (#11742)

7b39a0c

### Summary Fixed linter error. ### Test plan CI Co-authored-by: Guang Yang <[email protected]>

Update documents for buffer allocator (#11467)

962db1b

### Summary 1. Update MediaTek backend documents for the decoupled buffer allocator. 2. Follow backend template. 3. Remove unnecessary instructions. Fixes #8532 @pytorchbot label "partner: mediatek"

Add function for input preprocessing in numerical comparator

1309849

Differential Revision: D76745314 Pull Request resolved: #11739

Arm backend: Added decomposition for MaxPool2d with dilation > 0. (#1…

6af28c9

…1724) Arm backend: Added decomposition for MaxPool2D operator with dilation > 0 Signed-off-by: Elena Zhelezina <[email protected]>

Arm backend: Fix bug in decompose linear vector norm (#11755)

5365c55

The introduction of decomposition for linalg vector norm revealed a bug that when dim is None, then all dimensions should be reduced. Signed-off-by: Elena Zhelezina <[email protected]>

Add numerical comparator base class and L1 comparator

3b1c7fd

Differential Revision: D76746854 Pull Request resolved: #11751

Fix text_llm_runner unit test

d1bfa4d

Differential Revision: D76791781 Pull Request resolved: #11750

Qualcomm AI Engine Direct - Deprecate convert_bmm_to_matmul pass (#11392

7503bb3

) ### Summary - delete convert_bmm_to_matmul pass - add torch.ops.aten.matmul.default in skip_decomp_table ### Test plan General CI

Add MSE numerical comparator

5638657

Differential Revision: D76781331 Pull Request resolved: #11759

skip et quantizer numeric debugging tests for infra update

7bd15b9

Differential Revision: D76842266 Pull Request resolved: #11764

[llm] Fix start_pos not being updated in prefill_chunk()

5960a4b

Differential Revision: D76483572 Pull Request resolved: #11592

[llm] Update metadata max_seq_len based on the max range of dynamic s…

57e0765

…hapes Differential Revision: D76530379 Pull Request resolved: #11611

Introduce extension/llm/export_llm

44d2643

Differential Revision: D76781745 Pull Request resolved: #11746

Use MAP_SHARED to allow sharing memory between processing (#11733)

5ca9876

hsharma35 and others added 7 commits June 23, 2025 14:09

Create a MemoryPlanningAlgo class.

be07160

Differential Revision: D76954785 Pull Request resolved: #11824

Ability to specify full file configs for export_llm (#11809)

d83636d

Add inspector numeric gap calculation between AOT and runtime interme…

222d9e3

…diate outputs Differential Revision: D76831086 Pull Request resolved: #11855

Fix LlmConfig enum usage (#11833)

3b02c99

Fixes some bugs with how enum fields are used.

Update export_llama in READMEs to use export_llm (#11811)

4df9290

Update documentation to use the new `export_llm` instead of the old `export_llama`.

GregoryComer requested review from digantdesai and mcr229 as code owners June 23, 2025 23:14

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2025

leafs1 and others added 4 commits June 23, 2025 16:36

GregoryComer force-pushed the gh/mcr229/33/orig branch from 84a5579 to b7572d0 Compare June 23, 2025 23:47

GregoryComer requested review from Gasoonjia, JacobSzwejbka, SS-JIA, cccclai, jackzhxng, kirklandsign, larryliu0820, lucylq, manuelcandales, mergennachin, shoumikhin and swolchok as code owners June 23, 2025 23:47

github-actions bot added the stale PRs inactive for over 60 days label Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[XNNPACK Quantizer] Select between TConvs and Convs #11863

[XNNPACK Quantizer] Select between TConvs and Convs #11863

Uh oh!

GregoryComer commented Jun 23, 2025

Uh oh!

pytorch-bot bot commented Jun 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

32 participants

Uh oh!

[XNNPACK Quantizer] Select between TConvs and Convs #11863

Are you sure you want to change the base?

[XNNPACK Quantizer] Select between TConvs and Convs #11863

Uh oh!

Conversation

GregoryComer commented Jun 23, 2025

Uh oh!

pytorch-bot bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11863

✅ No Failures

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

32 participants

pytorch-bot bot commented Jun 23, 2025 •

edited

Loading