At the end of Generalised ufuncs, there is this short comment mentioning that numpy is better at matrix multiplication than a naive for loop. We could add another sentence shortly mentioning that the naive matrix multiplication is very cache-inefficient and (very roughly) how BLAS gets around that. (And that numpy uses BLAS, of course.)