-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendGeneral perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
The test_perf.py regression test shows large variance when run in CI. I see that it runs on two different device types: H100 NVL and H100 PCIE. according to that the perf varies and also the clock frequencies. Need to ability to define threshold per device type if possible, otherwise need to use only one device type by applying a filter in the ci definitions (devops said that they use two device kinds due to machine shortage). see the attached excel chart analyzing the perf variance in correlation to the clock freqs implying difference in perf is due to two different freq levels due to two different device types being used.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendGeneral perf<NV>Broad performance issues not specific to a particular component<NV>Broad performance issues not specific to a particular componentfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Type
Projects
Status
Done