Skip to content

How to determine bottleneck factors? #2722

@MahmoudAshraf97

Description

@MahmoudAshraf97

I'm using TRT LLM to run whisper Large which is a 1.5B encoder decoder model, I switched from L40S to H100 PCIe which has more than double the memory bandwidth and also more than double the theoretical FP16 TFLOPS, the performance using Executor API with IFB only increased around 40% so there has to be a bottleneck either in compute or memory bandwidth or somewhere else, my question is how to assess that?

I prepare all requests and submit them at once to ensure there are no overheads from data loading or anything except inference
GPU utilization is 100% using nvidia-smi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions