Skip to content

NVSHMEM4Py Integration #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Willy-Chan
Copy link

This PR will replace existing custom python bindings with NVSHMEM4Py - official Python language binding for NVSHMEM.

Scope of changes include:

  • Migrate from static linking of host-side initialization (libnvshmem.a) to dynamic linking (libnvshmem_host.so)
  • Remove existing bindings and replaced them with their respective NVSHMEM4Py core API
  • Added helper function for NVSHMEM initialization

@Willy-Chan Willy-Chan force-pushed the nvshmem4py_bindings_integration branch from 428a067 to d258021 Compare August 7, 2025 00:04
Changes include dynamic linking with host-side initialization, deletion
of existing bindings, addition of nvshmem4py, and addition of helper
functions.
@Willy-Chan Willy-Chan force-pushed the nvshmem4py_bindings_integration branch from 0477c4c to c13a7fe Compare August 7, 2025 00:26
@Willy-Chan
Copy link
Author

Willy-Chan commented Aug 7, 2025

Confirmed functionality on the following system configurations:

  1. DGX-H100 with Infiniband
  2. DGX-H100 with AWS-EFA
  3. B200 with MNNVL

These are benchmark perf difference results on dispatch and combine using MoEConfig(128, 8, 7186, 128):

System Experts E/Tok Hidden Dim Tokens Dispatch Latency Perf Diff (lower is better) Combine Latency Perf Diff (lower is better)
H100 128 8 7168 128 2.15% 2.96%
H100 (IB, 2 Node) 128 8 7168 128 0.84% -2.94%
H100 (IB, 4 Node) 128 8 7168 128 0.59% 1.39%
H100 (IB, 8 Node) 128 8 7168 128 1.74% -1.11%
H100 (EFA, 2 Node) 128 8 7168 128 -2.24% 1.27%
H100 (EFA, 4 Node) 128 8 7168 128 2.27% -2.11%
H100 (EFA, 8 Node) 128 8 7168 128 -1.41% -3.67%
B200 128 8 7168 128 1.08% 0.60%
B200 (NVLink, 2 Node) 128 8 7168 128 -2.23% 0.81%
B200 (NVLink, 4 Node) 128 8 7168 128 1.94% -3.09%
B200 (NVLink, 8 Node) 128 8 7168 128 -5.60% -2.90%

Performance percentages are calculated by measuring kernel latency using the provided benchmark, and negative percentage indicates that NVSHMEM4Py is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants