
Thanks for sharing this excellent implementation of ring attention.
Here are my test results on 2*A100 (with nvlink). Judging from the results, the memory usage of ring attention(ring_flash_attn_qkvpacked_func) seems to be very large. This is not as expected. Are there any possible problems?