When wasm_f32x4_convert_i32x4 intrinsic gets its input from an instruction that clears top bits, the conversion gets compiled into i32x4_u instead of i32x4_s variant; for example:
v128_t plsno(v128_t x)
{
// u32x4 here changes the convert instruction; it's a problem because u32->f32 is way slower on pre-AVX512 HW
x = wasm_u32x4_shr(x, 1);
return wasm_f32x4_convert_i32x4(x);
}
With -msimd128 -O2 compiles into
local.get 0
i32.const 1
i32x4.shr_u
f32x4.convert_i32x4_u
end_function
This is a problem because on x64 hardware, convert_i32x4_u gets lowered into a long multi instruction sequence unless the browser implements AVX512 code path and the hardware supports it. Thus this needlessly slows down efficient SIMD kernels.