-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't working
Description
System Info
Debian 12
256GB RAM
TRT-LLM: 0.16.0
Running on an H100 80GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- Build engine for Gemma 3
- Run inference with very long sequence (>>1024 tokens for a local window attn size of 1024)
- Outputs are bad
An example of a long sequence could be something like: "Can you repeat this exact paragraph: "long sequence""; this will break with the currently built Gemma3 engine
Expected behavior
- Outputs should match HF
actual behavior
- Outputs bad
additional notes
I have a fix I'd like to contribute - #9961
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.bugSomething isn't workingSomething isn't working