-
Notifications
You must be signed in to change notification settings - Fork 23
[tritonbench] fix tritonbench noise issue #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Quick fyi just in case you missed it:
|
|
@nWEIdia I get an error: "docker: Error response from daemon: Requested CPUs are not available - requested 10, available: 140-167" (https://github.com/pytorch/pytorch-integration-testing/actions/runs/20970063138/job/60270801988). Is this set of CPU cores fixed? How do I pin to single CPU core, e.g., can I use |
|
It becomes tricky, as Meta's provision scripts has their own way of dividing CPU cores to the 8 runners (user Alice/Bob -> through Henry) and each are confined to those CPU cores. |
|
In the multi-tenancy setup, the CPU are sliced so that each user has an equivalent, no overlapping share https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/multi-tenant/playbooks/setup-host.yml#L206. This is under the assumption that all CPU cores are the same Paste the snippet here for @nWEIdia visibility: So that the reason why in an 8 users setup, |
huydhn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
nWEIdia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shipit!
|
@huydhn it seems I still couldn't run docker pinned to 1 cpu core. The error is at https://github.com/pytorch/pytorch-integration-testing/actions/runs/20979471728/job/60301236292: "docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process:" Meanwhile, I will try with the non docker version at #118 |
|
This is replaced by #118 |
We found that DGX B200 runner's CPU is unstable when allowing the process to migrate across multiple CPU cores. We pin the process to a single CPU core to mitigate.
Mitigates #130
Test plan:
https://github.com/pytorch/pytorch-integration-testing/actions/runs/20979471728
Manual validation on the DGX host:
Before:
After: