support custom triton sdpa kernel in CUDA backend

### 🚀 The feature, motivation and pitch

Currnently one of the main gap for cuda backend is we don't support sdpa kernel in one step, but need to decompose it which introduces extra perf latency. 
We should have a single triton sdpa kernel for CUDA backend.

### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_