You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The handwritten optimized code is similar to what we should be getting
from the optimized portable op, as follows.
handwritten optimized code:
- if the input type matches the output type, then perform a vectorized
loop
- otherwise, generate specific mixed-dtype kernels, which aren't
vectorized.
optimized portable op:
- if the input type matches the output type, then perform a vectorized
loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h)
- otherwise, generate one specific kernel per compute type. those
kernels use non-inlined function calls to do loads and stores,
trading off performance for a significant size reduction.
(apply_elementwise_fn_generic_impl in elementwise_util.h)
Both cases in the portable op variant also use parallel_for.
I attempted to do a performance test, but I found that
`torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor,
*not*
mul.Scalar.
41e7ffa
added the ability to pass our tests if we do emit mul.Scalar for this,
but the follow-up diff to make that happen seems not to have
landed. So, I think another reason to delete this is that (if I
understand correctly) it's not used, therefore we don't have specific
knowledge that we need it to exist and we can't just use the optimized
portable op.
0 commit comments