[Feature]:  Solve the Perf regression test large variance

### 🚀 The feature, motivation and pitch

The test_perf.py regression test shows large variance when run in CI. I see that it runs on two different device types: H100 NVL and H100 PCIE. according to that the perf varies and also the clock frequencies. Need to ability to define threshold per device type if possible, otherwise need to use only one device type by applying a filter in the ci definitions (devops said that they use two device kinds due to machine shortage). see the attached excel chart analyzing the perf variance in correlation to the clock freqs implying difference in perf is due to two different freq levels due to two different device types being used.

[test_report_test_perf_metric_token_throughput_llama_v3.1_8b_instruct-bench-_autodeploy-float16-maxbs_512-maxnt_2.csv](https://github.com/user-attachments/files/22922456/test_report_test_perf_metric_token_throughput_llama_v3.1_8b_instruct-bench-_autodeploy-float16-maxbs_512-maxnt_2.csv)

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Solve the Perf regression test large variance #8391

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Solve the Perf regression test large variance #8391

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions