Different output from HF and Tensorrt-llm #2754
Unanswered
Ericoool9614
asked this question in
Q&A
Replies: 1 comment
-
The difference appears after the first logits output from runtime.generation.handle_per_step |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Model: Internvl2-8B
Precision: BF16, no quantization
No sampling strategy (temperature=0, do_sample=False in HF generation_config, so both are greedy search)
Single GPU execution, no model parallelism
No batch (batch size = 1)
Inference is performed using Hugging Face model.chat method and Tensorrt-llm MultimodalModelRunner.run() method.
Beta Was this translation helpful? Give feedback.
All reactions