-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Description
I'm using TRT LLM to run whisper Large which is a 1.5B encoder decoder model, I switched from L40S to H100 PCIe which has more than double the memory bandwidth and also more than double the theoretical FP16 TFLOPS, the performance using Executor API with IFB only increased around 40% so there has to be a bottleneck either in compute or memory bandwidth or somewhere else, my question is how to assess that?
I prepare all requests and submit them at once to ensure there are no overheads from data loading or anything except inference
GPU utilization is 100% using nvidia-smi
Metadata
Metadata
Assignees
Labels
No labels