Skip to content

Commit 2d095b8

Browse files
authored
Delete opt_mul_scalar_out (#12145)
The handwritten optimized code is similar to what we should be getting from the optimized portable op, as follows. handwritten optimized code: - if the input type matches the output type, then perform a vectorized loop - otherwise, generate specific mixed-dtype kernels, which aren't vectorized. optimized portable op: - if the input type matches the output type, then perform a vectorized loop. (dtype_specialized_elementwise_fn_impl in elementwise_util.h) - otherwise, generate one specific kernel per compute type. those kernels use non-inlined function calls to do loads and stores, trading off performance for a significant size reduction. (apply_elementwise_fn_generic_impl in elementwise_util.h) Both cases in the portable op variant also use parallel_for. I attempted to do a performance test, but I found that `torch.mul(some_tensor, 2.0)` is exported as a call to mul.Tensor, *not* mul.Scalar. 41e7ffa added the ability to pass our tests if we do emit mul.Scalar for this, but the follow-up diff to make that happen seems not to have landed. So, I think another reason to delete this is that (if I understand correctly) it's not used, therefore we don't have specific knowledge that we need it to exist and we can't just use the optimized portable op.
1 parent 14085eb commit 2d095b8

File tree

2 files changed

+0
-62
lines changed

2 files changed

+0
-62
lines changed

kernels/optimized/cpu/op_mul.cpp

Lines changed: 0 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -210,63 +210,6 @@ Tensor& opt_mul_out(
210210
return out;
211211
}
212212

213-
Tensor& opt_mul_scalar_out(
214-
KernelRuntimeContext& ctx,
215-
const Tensor& a,
216-
const Scalar& b,
217-
Tensor& out) {
218-
(void)ctx;
219-
220-
ScalarType a_type = a.scalar_type();
221-
ScalarType common_type =
222-
utils::promote_type_with_scalar(a_type, b, /*half_to_float*/ false);
223-
ScalarType out_type = out.scalar_type();
224-
225-
ET_CHECK(common_type == out_type);
226-
227-
if (common_type == ScalarType::Half || common_type == ScalarType::BFloat16) {
228-
common_type = ScalarType::Float;
229-
}
230-
231-
// Resize for dynamic shape
232-
auto error = resize_tensor(out, a.sizes());
233-
ET_CHECK_MSG(error == Error::Ok, "Failed to resize output tensor.");
234-
235-
if (a_type == common_type && a_type == out_type &&
236-
a_type != ScalarType::Half && a_type != ScalarType::BFloat16) {
237-
ET_SWITCH_REALB_TYPES(a_type, ctx, "mul.Scalar_out", CTYPE, [&]() {
238-
CTYPE b_casted = utils::scalar_to<CTYPE>(b);
239-
240-
using Vec = at::vec::Vectorized<CTYPE>;
241-
at::vec::map<CTYPE>(
242-
[b_casted](Vec x) { return x * Vec(b_casted); },
243-
out.mutable_data_ptr<CTYPE>(),
244-
a.const_data_ptr<CTYPE>(),
245-
out.numel());
246-
});
247-
} else {
248-
ET_SWITCH_REALHBBF16_TYPES(a_type, ctx, "mul.Scalar_out", CTYPE_A, [&]() {
249-
ET_SWITCH_REALB_TYPES(
250-
common_type, ctx, "mul.Scalar_out", CTYPE_IN, [&]() {
251-
ET_SWITCH_REALHBBF16_TYPES(
252-
out_type, ctx, "mul.Scalar_out", CTYPE_OUT, [&]() {
253-
CTYPE_IN b_casted = utils::scalar_to<CTYPE_IN>(b);
254-
255-
const size_t n = a.numel();
256-
const CTYPE_A* a_data = a.const_data_ptr<CTYPE_A>();
257-
CTYPE_OUT* out_data = out.mutable_data_ptr<CTYPE_OUT>();
258-
for (auto i = 0; i < n; ++i) {
259-
out_data[i] = static_cast<CTYPE_OUT>(
260-
static_cast<CTYPE_IN>(a_data[i]) * b_casted);
261-
}
262-
});
263-
});
264-
});
265-
}
266-
267-
return out;
268-
}
269-
270213
} // namespace native
271214
} // namespace executor
272215
} // namespace torch

kernels/optimized/optimized.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,11 +82,6 @@
8282
- arg_meta: null
8383
kernel_name: torch::executor::opt_mul_out
8484

85-
- op: mul.Scalar_out
86-
kernels:
87-
- arg_meta: null
88-
kernel_name: torch::executor::opt_mul_scalar_out
89-
9085
- op: native_layer_norm.out
9186
kernels:
9287
- arg_meta: null

0 commit comments

Comments
 (0)