[Feature]:  Enable MLA prefill-only operator

### 🚀 The feature, motivation and pitch

FlashInfer MLA MHA-mode (no weight absorption) with cache; ragged; for prefill-only. Uses `flashinfer.BatchPrefillWithRaggedKVCacheWrapper`

This kernel is optimized for prefill.
Its inputs are full heads Q/K/V so it performs MLA in MHA-mode (no absorption). 
The operator should compute ckv and kpe and write them to the compressed cache, then continue to do MHA.
This can probably done with other flashinfer kernels.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Enable MLA prefill-only operator #8424

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Enable MLA prefill-only operator #8424

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions