There are currently 3 shapes (causal=false, d_head=64) that have performance <95% of XeTLA. - 32, 32, 512: 78% - 4, 32, 4096: 82% - 2, 32, 8192: 93%