-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
We are running DeepSeek V3.2 in TensorRT-LLM. We benchmarked the speed of prefills for both v3 and v3.2:
def test():
... t = time.time()
... text = "Hello " * 150000
... openai.completions.create(model="", prompt="\n\n<|User|>%s <|Assistant|> **</think>**" % text, max_tokens=100)
... print(time.time() - t)
DeepSeek V3.1 takes 8 seconds
DeepSeek V3.2 takes 18 seconds
Alternatives
The sparse attention is supposed to be faster or at least the same for prefill.
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
General perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support