``` using DoubleFloats, BenchmarkTools, LinearAlgebra const n=1000 A=randn(Double64,n,n) B=randn(Double64,n,n) C=zeros(Double64,n,n) @btime mul!($C,$A,$B); ``` gives ``` 21.144 s (0 allocations: 0 bytes) ``` while ``` @btime mul!($C,$A,$B,true,false); ``` gives ``` 13.402 s (3 allocations: 29.81 KiB) ``` Shouldn't `mul!(C,A,B)` be realized as `mul!(C,A,B,true,false)`? And why is the difference in performance and allocations?