How to determine bottleneck factors?

I'm using TRT LLM to run whisper Large which is a 1.5B encoder decoder model, I switched from L40S to H100 PCIe which has more than double the memory bandwidth and also more than double the theoretical FP16 TFLOPS, the performance using Executor API with IFB only increased around 40% so there has to be a bottleneck either in compute or memory bandwidth or somewhere else, my question is how to assess that?

I prepare all requests and submit them at once to ensure there are no overheads from data loading or anything except inference
GPU utilization is 100% using `nvidia-smi`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to determine bottleneck factors? #2722

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to determine bottleneck factors? #2722

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions