Skip to content

Conversation

@xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Jan 13, 2026

We found that DGX B200 runner's CPU is unstable when allowing the process to migrate across multiple CPU cores. We pin the process to a single CPU core to mitigate.

Mitigates #130

Test plan:

https://github.com/pytorch/pytorch-integration-testing/actions/runs/20979471728

Manual validation on the DGX host:

Before:

$ docker run -e CONDA_ENV=triton-main --gpus all --privileged -it ghcr.io/meta-pytorch/tritonbench:latest bash -c '. /workspace/setup_instance.sh; cd /workspace/tritonbench; python run.py --op embedding --only torch_embedding,liger_embedding --bwd'
           (B, T, D, V)    torch_embedding-latency    liger_embedding-latency
-----------------------  -------------------------  -------------------------
   (32, 512, 768, 1024)         0.532576 (±62.89%)         0.501024 (±73.63%)
   (32, 512, 768, 2048)         0.434016 (±74.22%)        0.089152 (±243.36%)
   (32, 512, 768, 4096)        0.204800 (±214.33%)         0.430048 (±78.37%)
   (32, 512, 768, 8192)         0.501600 (±50.62%)         0.496928 (±52.07%)
  (32, 512, 768, 16384)         0.410432 (±70.73%)         0.500512 (±60.92%)
  (32, 512, 768, 32768)         0.522272 (±27.65%)         0.351200 (±54.89%)
  (32, 512, 768, 65536)          0.453696 (±0.92%)          0.222272 (±1.84%)
 (32, 512, 768, 131072)          0.575424 (±0.86%)          0.337888 (±0.95%)
  (8, 2048, 4096, 1024)         0.563136 (±43.71%)          0.355264 (±0.61%)
  (8, 2048, 4096, 2048)         0.522272 (±30.48%)         0.514912 (±36.42%)
  (8, 2048, 4096, 4096)          0.538624 (±0.78%)         0.432160 (±64.47%)
  (8, 2048, 4096, 8192)          0.824416 (±0.62%)          0.502656 (±0.61%)
 (8, 2048, 4096, 16384)          1.240096 (±0.49%)          0.603008 (±0.66%)
 (8, 2048, 4096, 32768)          1.506272 (±0.48%)          0.770944 (±2.40%)
 (8, 2048, 4096, 65536)          1.874880 (±0.43%)          1.076320 (±0.19%)
(8, 2048, 4096, 131072)          2.512864 (±0.32%)          1.680320 (±0.25%)
                average         0.8260860005393624          0.554038003552705

After:

$ docker run --cpuset-cpus 10 -e CONDA_ENV=triton-main --gpus all --privileged -it ghcr.io/meta-pytorch/tritonbench:latest bash -c '. /workspace/setup_instance.sh; cd /workspace/tritonbench; python run.py --op embedding --only torch_embedding,liger_embedding --bwd'


           (B, T, D, V)    torch_embedding-latency    liger_embedding-latency
-----------------------  -------------------------  -------------------------
   (32, 512, 768, 1024)          0.176096 (±1.25%)          0.085056 (±4.82%)
   (32, 512, 768, 2048)          0.186368 (±0.12%)          0.089056 (±2.23%)
   (32, 512, 768, 4096)          0.202784 (±2.08%)          0.094240 (±3.46%)
   (32, 512, 768, 8192)          0.247904 (±1.65%)          0.101344 (±0.22%)
  (32, 512, 768, 16384)          0.325600 (±0.99%)          0.119872 (±1.84%)
  (32, 512, 768, 32768)          0.381088 (±1.04%)          0.160640 (±1.91%)
  (32, 512, 768, 65536)          0.453792 (±0.85%)          0.222240 (±0.98%)
 (32, 512, 768, 131072)          0.575328 (±0.85%)          0.338016 (±1.20%)
  (8, 2048, 4096, 1024)          0.317440 (±0.06%)          0.355168 (±0.85%)
  (8, 2048, 4096, 2048)          0.378976 (±1.32%)          0.383904 (±1.06%)
  (8, 2048, 4096, 4096)          0.536608 (±0.78%)          0.431104 (±0.51%)
  (8, 2048, 4096, 8192)          0.823200 (±0.64%)          0.501984 (±0.60%)
 (8, 2048, 4096, 16384)          1.240224 (±0.51%)          0.603072 (±0.70%)
 (8, 2048, 4096, 32768)          1.505280 (±0.47%)          0.769088 (±0.54%)
 (8, 2048, 4096, 65536)          1.872960 (±0.39%)          1.076320 (±0.28%)
(8, 2048, 4096, 131072)          2.513056 (±0.30%)          1.680384 (±0.37%)
                average         0.7335439994931221        0.43821800500154495

@nWEIdia
Copy link

nWEIdia commented Jan 13, 2026

Quick fyi just in case you missed it:

docker run --runtime=nvidia <the rest stays the same>
can be used to totally replace the use of --gpus all --privileged

@nWEIdia nWEIdia mentioned this pull request Jan 13, 2026
@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 13, 2026

@nWEIdia I get an error: "docker: Error response from daemon: Requested CPUs are not available - requested 10, available: 140-167" (https://github.com/pytorch/pytorch-integration-testing/actions/runs/20970063138/job/60270801988). Is this set of CPU cores fixed? How do I pin to single CPU core, e.g., can I use --cpuset-cpus 140 ?

@nWEIdia
Copy link

nWEIdia commented Jan 13, 2026

It becomes tricky, as Meta's provision scripts has their own way of dividing CPU cores to the 8 runners (user Alice/Bob -> through Henry) and each are confined to those CPU cores.
cc @huydhn for ideas.

@huydhn
Copy link
Contributor

huydhn commented Jan 14, 2026

In the multi-tenancy setup, the CPU are sliced so that each user has an equivalent, no overlapping share https://github.com/meta-pytorch/pytorch-gha-infra/blob/main/multi-tenant/playbooks/setup-host.yml#L206. This is under the assumption that all CPU cores are the same

Paste the snippet here for @nWEIdia visibility:

[Slice]
AllowedCPUs={{ (cpu_cores.stdout | int // ansible_loop.length | int) * ansible_loop.index0 }}-{{ ((cpu_cores.stdout | int // ansible_loop.length | int) * ansible_loop.index) - 1 }}
MemoryMax={{ memory_per_user }}
TasksMax=10000
DevicePolicy=closed

So that the reason why in an 8 users setup, AllowedCPUs is different for each user. Let me see if there is a bash command to find out which CPU cores are allowed. In the above example, 140-167 is the assigned CPU cores of the runner (user) picking up the job, then we can pin to any CPU in the list.

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 14, 2026

@huydhn @nWEIdia I verify that the allowed core list can be extracted by command taskset -pc $$

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shipit!

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 14, 2026

@huydhn it seems I still couldn't run docker pinned to 1 cpu core. The error is at https://github.com/pytorch/pytorch-integration-testing/actions/runs/20979471728/job/60301236292: "docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process:"

Meanwhile, I will try with the non docker version at #118

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jan 14, 2026

This is replaced by #118

@xuzhao9 xuzhao9 closed this Jan 14, 2026
@xuzhao9 xuzhao9 deleted the xz9/tritonbench-fix branch January 15, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants