Flash Attention 4

Given that torch's [FlexAttention just added a FA4 backend](https://pytorch.org/blog/flexattention-flashattention-4-fast-and-flexible), do you expect NATTEN with the FlexAttention backend to become faster than your builtin kernels? :)