[Feature]: DeepSeek V3.2 prefill is 2x slower than V3.1

### 🚀 The feature, motivation and pitch

We are running DeepSeek V3.2 in TensorRT-LLM. We benchmarked the speed of prefills for both v3 and v3.2:
```
def test():
...   t = time.time()
...   text = "Hello " * 150000
...   openai.completions.create(model="", prompt="\n\n<|User|>%s    <|Assistant|>      **</think>**" % text, max_tokens=100)
...   print(time.time() - t)
```
DeepSeek V3.1 takes 8 seconds
DeepSeek V3.2 takes 18 seconds

### Alternatives

The sparse attention is supposed to be faster or at least the same for prefill.

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: DeepSeek V3.2 prefill is 2x slower than V3.1 #9933

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: DeepSeek V3.2 prefill is 2x slower than V3.1 #9933

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions