In #2951, we change bfloat16 llvm.fmul to float32 llvm.fmul to get the correct result, which is same logic as the elementwise fmul.
The reason could be that the bfloat16 type fmul from triton pipeline is converted to half type fmul after IGC compilation. Examples are uploaded to fmul_bf16.zip.
We should confirm which part leads to this issue. LLVMToSPIRV translator mis-translate bfloat16 type fmul or IGC does not handle bfloat16 type fmul well.