-
Notifications
You must be signed in to change notification settings - Fork 749
Delete opt_mul_scalar_out #12145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete opt_mul_scalar_out #12145
Conversation
|
Stack from ghstack (oldest at bottom): |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12145
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 80a49db with merge base 7e28a04 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows. handwritten optimized code: - if the input type matches the output type, then perform a vectorized loop - otherwise, generate specific mixed-dtype kernels, which aren't vectorized. optimized portable op: - if the input type matches the output type, then perform a vectorized loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h) - otherwise, generate one specific kernel per compute type. those kernels use non-inlined function calls to do loads and stores, trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h) Both cases in the portable op variant also use parallel_for. I attempted to do a performance test, but I found that `torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor, *not* mul.Scalar. 41e7ffa added the ability to pass our tests if we do emit mul.Scalar for this, but the follow-up diff to make that happen seems not to have landed. So, I think another reason to delete this is that (if I understand correctly) it's not used, therefore we don't have specific knowledge that we need it to exist and we can't just use the optimized portable op. ghstack-source-id: 9a25f5a ghstack-comment-id: 3025417542 Pull-Request-resolved: #12145
This PR needs a
|
kimishpatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust that the summary holds true. I think the question of mult-threading remains in that it is not always beneficial to multithread. Our portable implementation should account for that else for some case, I wont be surprised if I saw pref regression
The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows. handwritten optimized code: - if the input type matches the output type, then perform a vectorized loop - otherwise, generate specific mixed-dtype kernels, which aren't vectorized. optimized portable op: - if the input type matches the output type, then perform a vectorized loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h) - otherwise, generate one specific kernel per compute type. those kernels use non-inlined function calls to do loads and stores, trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h) Both cases in the portable op variant also use parallel_for. I attempted to do a performance test, but I found that `torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor, *not* mul.Scalar. pytorch@41e7ffa added the ability to pass our tests if we do emit mul.Scalar for this, but the follow-up diff to make that happen seems not to have landed. So, I think another reason to delete this is that (if I understand correctly) it's not used, therefore we don't have specific knowledge that we need it to exist and we can't just use the optimized portable op.
The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows.
handwritten optimized code:
optimized portable op:
kernels use non-inlined function calls to do loads and stores,
trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h)
Both cases in the portable op variant also use parallel_for.
I attempted to do a performance test, but I found that
torch.mul(some_tensor, 2.0)is exported as a call to mul.Tensor,not
mul.Scalar. 41e7ffa
added the ability to pass our tests if we do emit mul.Scalar for this,
but the follow-up diff to make that happen seems not to have
landed. So, I think another reason to delete this is that (if I
understand correctly) it's not used, therefore we don't have specific
knowledge that we need it to exist and we can't just use the optimized
portable op.