We might be able to implement _extract_patches more efficiently for some special cases, e.g.
- when
dilation=1, we can look into using something like the im2col_2d function in ASDL (if indeed faster);
- when using KFAC-reduce, we should be using Felix's implementation in
einconv, see the corresponding paper.