Skip to content

[Feature]: DeepSeek V3.2 prefill is 2x slower than V3.1 #9933

@Shang-Pin

Description

@Shang-Pin

🚀 The feature, motivation and pitch

We are running DeepSeek V3.2 in TensorRT-LLM. We benchmarked the speed of prefills for both v3 and v3.2:

def test():
...   t = time.time()
...   text = "Hello " * 150000
...   openai.completions.create(model="", prompt="\n\n<|User|>%s    <|Assistant|>      **</think>**" % text, max_tokens=100)
...   print(time.time() - t)

DeepSeek V3.1 takes 8 seconds
DeepSeek V3.2 takes 18 seconds

Alternatives

The sparse attention is supposed to be faster or at least the same for prefill.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

General perf<NV>Broad performance issues not specific to a particular componentfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions