Skip to content

Conversation

@GregoryComer
Copy link
Member

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #11732 by @mcr229
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/mcr229/33/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/33/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/mcr229/32/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/mcr229/33/orig
@diff-train-skip-merge

leafs1 and others added 30 commits June 16, 2025 14:13
…uple outputs (#11647)

### Summary
This PR fixes the `channels_last_tagged_reshape_pass.py` to properly
handle tuple outputs with mixed memory formats. Previously, the pass
only checked and converted the first element of tuple outputs, which
could lead to incorrect memory formats for other elements in the tuple.
This fix is important for models that return multiple outputs with
different memory format requirements, such as a mix of convolution
outputs (which should be in NHWC format) and linear outputs (which
should be in standard format).

### Test plan
I added a new test class `ThreeOutputsModel` that has three outputs with
different memory format requirements. I ensured that this test output
given NCHW and NHWC inputs would evaluate properly. I also created a
simpler 2 input class `ConvAddConvOutput` which operated on different
inputs and returned two different dim order outputs.
Differential Revision: D76737404

Pull Request resolved: #11727
Differential Revision: D76469624

Pull Request resolved: #11577
### Summary
Fixed linter error.

### Test plan
CI

Co-authored-by: Guang Yang <[email protected]>
#11745)

### Summary
Running `install_dev.py` for `optimum-executorch` will force overriding
installed `executorch` and torch deps to the pinned nightly in
`optimum-executorch`. In ExecuTorch CI including the benchmark, we would
want to always run the optimum-executorch models with ExecuTorch from
source to catch issues/regressions.

### Test plan
Verified the installed deps in the CI and benchmark jobs

Co-authored-by: Guang Yang <[email protected]>
### Summary
1. Update MediaTek backend documents for the decoupled buffer allocator.
2. Follow backend template.
3. Remove unnecessary instructions.

Fixes #8532 

@pytorchbot label "partner: mediatek"
Differential Revision: D76745314

Pull Request resolved: #11739
As titled, this API allows us to support multi-turn conversation by
passing in a `start_pos` argument to `generate_from_pos`.

This pull request introduces a new feature to support text generation
from a specific starting position (`generate_from_pos`) and includes
updates to ensure proper error handling and functionality when
`max_new_tokens` is negative. The changes primarily focus on extending
the `TextLLMRunner` class and its associated methods to accommodate this
new feature while maintaining backward compatibility.

### New Feature: Text Generation from a Specific Starting Position

* **Added `generate_from_pos` Method**: Introduced a new method
`generate_from_pos` in `TextLLMRunner` to allow text generation starting
from a specified position in the KV cache. This includes updates to the
method signature, logic, and error handling.
(`extension/llm/runner/text_llm_runner.cpp`
[[1]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440L76-R78)
[[2]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R129-R156)
[[3]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440L150-R165)
[[4]](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R219-R225);
`extension/llm/runner/text_llm_runner.h`
[[5]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bR98-R122)

* **Updated Documentation**: Enhanced method documentation in
`TextLLMRunner` to describe the new functionality, including parameters
like `start_pos` and the expected behavior.
(`extension/llm/runner/text_llm_runner.h`
[[1]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bL81-R83)
[[2]](diffhunk://#diff-d1aa44a87ea9b7ec51250c2002466cb9bd57db153c1c8b58ffdf73e8f231a89bR98-R122)

### Error Handling Improvements

* **Validation for `max_new_tokens`**: Added checks to ensure
`max_new_tokens` is positive. If it is not, an `InvalidArgument` error
is returned. This prevents invalid configurations during text
generation. (`extension/llm/runner/text_llm_runner.cpp`
[extension/llm/runner/text_llm_runner.cppR129-R156](diffhunk://#diff-9b3bd38c0b1ad81b18afab15784634e2b394fda448f5e2dae03de58870751440R129-R156))

* **Unit Test for Negative `max_new_tokens`**: Created a new test case
(`GenerateFromPosErrorsWithNegativeMaxNewTokens`) to verify that the
`generate_from_pos` method correctly handles scenarios where
`max_new_tokens` is negative.
(`extension/llm/runner/test/test_text_llm_runner.cpp`
[extension/llm/runner/test/test_text_llm_runner.cppR325-R379](diffhunk://#diff-0a1e69b4182878ccad887c4f4ba3929ef24082a26623e26a871d73f4e6cea503R325-R379))
…1724)

Arm backend: Added decomposition for MaxPool2D operator with dilation >
0

Signed-off-by: Elena Zhelezina <[email protected]>
- Adds support for per-channel quantization in TosaQuantizer and
TosaBackend
- Enables per-channel quantization for MobilneNetV2 test cases


cc @digantdesai @freddan80 @per @zingo

---------

Signed-off-by: Oscar Andersson <[email protected]>
The introduction of decomposition for linalg vector norm revealed a bug
that when dim is None, then all dimensions should be reduced.

Signed-off-by: Elena Zhelezina <[email protected]>
Differential Revision: D76746854

Pull Request resolved: #11751
Differential Revision: D76791781

Pull Request resolved: #11750
### Summary
This PR uses `xnn_define_binary` and `xnn_define_unary` to define
XNNPack ops, instead of separately calling the individual definitions.

Further changes:
1. Removes individual node definitions for unary and binary ops
2. Creates a wrapper macro to generate function defs for individual ops
using `xnn_define_binary` and `xnn_define_unary` inside.

Fixes #11584

### Test plan
```
## Build steps
cmake -DEXECUTORCH_BUILD_XNNPACK=ON ..
cmake --build cmake-out -j9

Tests ran:
./test/run_oss_cpp_tests.sh
.
.
.
100% tests passed, 0 tests failed out of 86
```
…1546)

### Summary
This PR consists of 4 Encoder-Only models.
Following stats are based on SM8750.
1. Albert (16a16w)
- Accuracy: ~22% (NOTE: nn.Module accuracy is around 24%, so the
similarity between QNN and nn.Module is around 92%)
- Speed: 11ms/inf
- Script: `python examples/qualcomm/oss_scripts/albert.py -b
build-android -s $DEVICE -m SM8750 --dataset
../wikipedia-sentences/wikisent2.txt`
2. Bert (16a8w)
-  Accuracy: ~60%
- Speed: 9ms/inf
- Script: `python examples/qualcomm/oss_scripts/bert.py -b build-android
-s $DEVICE -m SM8750 --dataset ../wikipedia-sentences/wikisent2.txt`
3. Distilbert (16a8w)
-  Accuracy: ~59%
- Speed: 8ms/inf
- Script: `python examples/qualcomm/oss_scripts/distilbert.py -b
build-android -s $DEVICE -m SM8750 --dataset
../wikipedia-sentences/wikisent2.txt`
4. Eurobert (16a16w)
-  Accuracy: ~54%
- Speed: 40ms/inf
- Script: `python examples/qualcomm/oss_scripts/eurobert.py -b
build-android -s $DEVICE -m SM8750 --dataset
../wikipedia-sentences/wikisent2.txt`



### Test plan

- E2E Scripts under `test_qnn_delegate.py`
- Example script: `python backends/qualcomm/tests/test_qnn_delegate.py
-k TestExampleOssScript.test_{BERT_MODEL} --model SM8750 -s $DEVICE
--build_folder build-android/ -r ./ -a ./test --sentence_dataset
../wikipedia-sentences/wikisent2.txt`
- Mainline CI

Author: @haowhsu-quic, @chunit-quic, @winskuo-quic
)

### Summary
- delete convert_bmm_to_matmul pass
- add torch.ops.aten.matmul.default in skip_decomp_table

### Test plan
General CI
Differential Revision: D76781331

Pull Request resolved: #11759
#11596)

### Summary
Refactor the XNNPACK tester to split out reusable base components from
XNNPACK-specific parts. I've relocated the base classes to
backends/test/harness.

I've kept the tester structure pretty much unchanged, except for
replacing stage names with an enum.

It looks like Arm tests are directly importing for XNNPACK's tester
currently. Ideally, we'll want to refactor to have their own stage
implementations, but I've left that as a follow-up to minimize changes
for the initial refactor.

### Test plan
CI
… fbsource sleef (#11261)" (#11765)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #11657 by
@swolchok
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/swolchok/458/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/swolchok/458/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/swolchok/458/orig
@diff-train-skip-merge

Co-authored-by: Scott Wolchok <[email protected]>
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #11369 by
@ahmtox
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/ahmtox/11/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/11/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/ahmtox/11/orig
@diff-train-skip-merge

Co-authored-by: morelos <[email protected]>
Creating the dequantize_per_tensor and dequantize_per_token logic shaders and impl which are linked with the testing framework.

Differential Revision: [D76267107](https://our.internmc.facebook.com/intern/diff/D76267107/)

[ghstack-poisoned]
Creating the choose_qparams per_tensor and per_token logic shaders and impl which are linked with the testing framework

Differential Revision: [D76436933](https://our.internmc.facebook.com/intern/diff/D76436933/)

[ghstack-poisoned]
Differential Revision: D76842266

Pull Request resolved: #11764
Differential Revision: D76483572

Pull Request resolved: #11592
…hapes

Differential Revision: D76530379

Pull Request resolved: #11611
…11778)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #11757 by
@cccclai
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/cccclai/28/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/cccclai/28/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/cccclai/28/orig
@diff-train-skip-merge

Co-authored-by: Chen Lai <[email protected]>
Differential Revision: D76781745

Pull Request resolved: #11746
)

- Constant placeholders with same values but different data types, such
as int32 and fp32, shouldn't be fused into a single placeholder.
Otherwise, some operators will have operands with mismatched dtypes.
- Fix the bug by adding a dtype check to fuse only constants with
matching types and same values.

Signed-off-by: Yufeng Shi <[email protected]>
hsharma35 and others added 7 commits June 23, 2025 14:09
Differential Revision: D76954785

Pull Request resolved: #11824
# Summary
Provide methods and script to fetch all execuTorch benchamrk data from
HUD API into two dataset,private and public, the script will:
- fetch all data from HUD API from input time range in UTC
- clean out records and tables with only FAILURE_REPORT due to job-level
failures
- get all private table metrics, generate `table_name` and find
intersected public table metrics
- generate private and public table groups
- output data

OutputType:
- run with excel-sheet export
- run with csv export
- run with dataframe format print
- run with json format print

See more guidance in README.md

the data is similar to the excel sheet generated manually in
#10982
The result should be the same as the hud per model datatable:
<img width="1480" alt="image"
src="https://github.com/user-attachments/assets/7c6cc12e-50c5-4ce2-ac87-5cac650486e3"
/>

## helper methods: common.py
provide common.py helper method to convert back csv and excel sheets
back to {"groupInfo":{}, "df":df.DataFrame} format.

# run with
``` bash
python3 .ci/scripts/benchmark_tooling/get_benchmark_analysis_data.py \
--startTime "2025-04-29T09:48:57" \
--endTime "2025-05-13T22:00:00" \
--outputType "excel" \
--models "mv3"

python3 .ci/scripts/benchmark_tooling/analyze_benchmark_stability.py \
--primary-file private.xlsx \
--reference-file public.xlsx
```
Generate excel files:

[private.xlsx](https://github.com/user-attachments/files/20844977/private.xlsx)

[public.xlsx](https://github.com/user-attachments/files/20844978/public.xlsx)


For instance you can find result for mv3 xnnq_q8 S22 Ultra android 14:
```

Latency Stability Analysis: table10 (Primary)
================================================================================
Model: mv3(xnnpack_q8)
Device: Samsung Galaxy S22 Ultra 5G (private)(Android 14)

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.91 ms
  - Median latency (P50): 2.54 ms
  - Mean trimmed latency: 2.41 ms
  - Median trimmed latency: 2.15 ms

Dispersion Metrics:
  - Standard deviation: 1.14 ms
  - Coefficient of variation (CV): 39.08%
  - Interquartile range (IQR): 0.82 ms
  - Trimmed standard deviation: 0.76 ms
  - Trimmed coefficient of variation: 31.60%

Percentile Metrics:
  - P50 (median): 2.54 ms
  - P90: 3.88 ms
  - P95: 4.60 ms
  - P99: 5.91 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 5.6103
  - P99/P50 ratio: 2.3319
  - Mean rolling std (window=5): 0.79 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 15.37%
  - Max trimming effect ratio: 38.83%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

Interpretation:
  The benchmark shows poor stability (score: 0.0/100) with significant
  variation between runs (CV: 39.08%).
  Performance is unpredictable and may lead to inconsistent user experience.

  The significant difference between raw and trimmed means suggests
  considerable intra-run jitter (15.4%) with occasional outliers within benchmark runs.

  The max/min ratio of 5.61 indicates
  substantial performance differences between the best and worst runs.

  The P99/P50 ratio of 2.33 suggests
  occasional latency spikes that could affect tail latency sensitive applications.
```

---------

Signed-off-by: Yang Wang <[email protected]>
…diate outputs

Differential Revision: D76831086

Pull Request resolved: #11855
…ups==1 (#11774)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: #11730 by
@mcr229
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/mcr229/31/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/mcr229/31/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/mcr229/31/orig
@diff-train-skip-merge

---------

Co-authored-by: Max Ren <[email protected]>
Co-authored-by: Gregory Comer <[email protected]>
Fixes some bugs with how enum fields are used.
Update documentation to use the new `export_llm` instead of the old
`export_llama`.
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11863

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b7572d0 with merge base 0c12dcd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2025
leafs1 and others added 4 commits June 23, 2025 16:36
### Summary
This PR adds support for the tanh operator in ExecuTorch via XNNPACK,
enabling optimized execution of torch.tanh on the XNNPACK backend. The
implementation includes updates to operator configuration,
serialization, and runtime handling. The tanh operator is now properly
registered in the XNNPACK partition config and mapped to XNNPACK's
xnn_create_tanh_operator API in the compiler.

### Test plan
I added a new test class TestTanh that is a simple torch model with a
tanh op. It then asserts that the XNNPACK delegate was called while
executing the tanh op instead of the torch default tanh op.
…ups==1

Pull Request resolved: #11730

Supporting Quantized Transposed Convs with Groups being 1.

Previously, There was some added support for Quantized Transposed Convolutions but only when the channel axis is 1 and when the groups is 1. The current Quantizer didn't support this because it only allows quantizaing along the zero dim, which is generally the output channels. However for TransposedConvs, the dimension of the weights are:
```
[in_channels, out_channels/groups, h, w]
```

Since we want to keep quantization along the output channels, we now need to quantize along axis = 1.

The reason we require groups to be one is because XNNPACK takes in filters of the dimension:
```
[out_channels, H, W, in_channels/groups]
```

Since we are quantizing along the output channels, in pytorch we expect to have out_channels/groups scales, but in xnnpack we have out_channels scales! Realistically we would need to support this with some affine quantization, where we provide a scale for every group, every out_channel. However for now, we just ensure the constraint where groups == 1.
ghstack-source-id: 291033630
@exported-using-ghexport

Differential Revision: [D76631781](https://our.internmc.facebook.com/intern/diff/D76631781/)
…groups ==1

Pull Request resolved: #11731

Here we support dynamically quantized Deconvolutions.

There is some refactoring of the previous diff, but in general, we just remove the constraint in the Dynamism check that the convolution isn't transposed. For the same reasons as before, this only supports channel_axis = 1 and groups = 1.
ghstack-source-id: 291033632
@exported-using-ghexport

Differential Revision: [D76638904](https://our.internmc.facebook.com/intern/diff/D76638904/)
Pull Request resolved: #11732

Allow selection of Difference between transposed convs and regular convs. Previously, we grouped all conv targets together (transposed and regular convs), but now we enable better per-operator selection
ghstack-source-id: 291033631

Differential Revision: [D76641838](https://our.internmc.facebook.com/intern/diff/D76641838/)
@github-actions
Copy link

github-actions bot commented Sep 2, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the stale PRs inactive for over 60 days label Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale PRs inactive for over 60 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.