🚀 The feature, motivation and pitch
Currnently one of the main gap for cuda backend is we don't support sdpa kernel in one step, but need to decompose it which introduces extra perf latency.
We should have a single triton sdpa kernel for CUDA backend.
Alternatives
No response
Additional context
No response
RFC (Optional)
No response