Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Jul 1, 2025

The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows.

handwritten optimized code:

  • if the input type matches the output type, then perform a vectorized loop
  • otherwise, generate specific mixed-dtype kernels, which aren't vectorized.

optimized portable op:

  • if the input type matches the output type, then perform a vectorized loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h)
  • otherwise, generate one specific kernel per compute type. those
    kernels use non-inlined function calls to do loads and stores,
    trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h)

Both cases in the portable op variant also use parallel_for.

I attempted to do a performance test, but I found that
torch.mul(some_tensor, 2.0) is exported as a call to mul.Tensor,
not
mul.Scalar. 41e7ffa
added the ability to pass our tests if we do emit mul.Scalar for this,
but the follow-up diff to make that happen seems not to have
landed. So, I think another reason to delete this is that (if I
understand correctly) it's not used, therefore we don't have specific
knowledge that we need it to exist and we can't just use the optimized
portable op.

[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

swolchok commented Jul 1, 2025

Stack from ghstack (oldest at bottom):

@swolchok swolchok requested a review from manuelcandales as a code owner July 1, 2025 20:34
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12145

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 80a49db with merge base 7e28a04 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Jul 1, 2025
The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows.

handwritten optimized code:
- if the input type matches the output type, then perform a vectorized loop
- otherwise, generate specific mixed-dtype kernels, which aren't vectorized.

optimized portable op:
- if the input type matches the output type, then perform a vectorized loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h)
- otherwise, generate one specific kernel per compute type. those
  kernels use non-inlined function calls to do loads and stores,
  trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h)

Both cases in the portable op variant also use parallel_for.

I attempted to do a performance test, but I found that
`torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor,
*not*
mul.Scalar. 41e7ffa
added the ability to pass our tests if we do emit mul.Scalar for this,
but the follow-up diff to make that happen seems not to have
landed. So, I think another reason to delete this is that (if I
understand correctly) it's not used, therefore we don't have specific
knowledge that we need it to exist and we can't just use the optimized
portable op.


ghstack-source-id: 9a25f5a
ghstack-comment-id: 3025417542
Pull-Request-resolved: #12145
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2025
@github-actions
Copy link

github-actions bot commented Jul 1, 2025

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@swolchok swolchok requested a review from kimishpatel July 3, 2025 20:22
Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust that the summary holds true. I think the question of mult-threading remains in that it is not always beneficial to multithread. Our portable implementation should account for that else for some case, I wont be surprised if I saw pref regression

@swolchok swolchok merged commit 2d095b8 into main Jul 8, 2025
94 of 97 checks passed
@swolchok swolchok deleted the gh/swolchok/488/head branch July 8, 2025 21:14
Tanish2101 pushed a commit to Tanish2101/executorch that referenced this pull request Jul 9, 2025
The handwritten optimized code is similar to what we should be getting
from the optimized portable op, as follows.

handwritten optimized code:
- if the input type matches the output type, then perform a vectorized
loop
- otherwise, generate specific mixed-dtype kernels, which aren't
vectorized.

optimized portable op:
- if the input type matches the output type, then perform a vectorized
loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h)
- otherwise, generate one specific kernel per compute type. those
  kernels use non-inlined function calls to do loads and stores,
trading off performance for a significant size reduction.
(apply_elementwise_fn_generic_impl in elementwise_util.h)

Both cases in the portable op variant also use parallel_for.

I attempted to do a performance test, but I found that
`torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor,
*not*
mul.Scalar.
pytorch@41e7ffa
added the ability to pass our tests if we do emit mul.Scalar for this,
but the follow-up diff to make that happen seems not to have
landed. So, I think another reason to delete this is that (if I
understand correctly) it's not used, therefore we don't have specific
knowledge that we need it to exist and we can't just use the optimized
portable op.
swolchok added a commit that referenced this pull request Jul 9, 2025
This triggered internal failures; kernels/optimized's tests don't build with Buck in OSS because they use ATen.


ghstack-source-id: 27f599d
ghstack-comment-id: 3054364234
Pull-Request: #12321
swolchok added a commit that referenced this pull request Jul 9, 2025
This triggered internal failures; kernels/optimized's tests don't build
with Buck in OSS because they use ATen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants