Skip to content

Add multi-node scaling for match template program#94

Merged
jdickerson95 merged 11 commits intoLucaslab-Berkeley:development_v1.1from
mgiammar:mdg_multinode_scaling
Sep 20, 2025
Merged

Add multi-node scaling for match template program#94
jdickerson95 merged 11 commits intoLucaslab-Berkeley:development_v1.1from
mgiammar:mdg_multinode_scaling

Conversation

@mgiammar
Copy link
Member

@mgiammar mgiammar commented Sep 10, 2025

Overview

Adds new feature for running the match template program across multi-node, multi-GPU systems by using the torch.distributed package. New file run_distributed_match_template.py is necessary for setting up the distributed environment and simplifying codepaths for the more common single-node, multi-GPU runs.

Todo before merging

  • Check that program executes properly on a distributed environment
  • Remove debugging print statements
  • Add click options to the Python script
  • Include basic docs on distributed match template

@mgiammar
Copy link
Member Author

Closes #89

@mgiammar
Copy link
Member Author

@jdickerson95 I'm using development_v1.1 as a quasi-staging branch for the new backend changes, but I think this PR is adding enough that it warrants a brief review to ensure what I'm proposing to add makes sense.

@mgiammar mgiammar self-assigned this Sep 11, 2025
@mgiammar mgiammar added the enhancement New feature or request label Sep 11, 2025
@mgiammar mgiammar added this to the v1.1 release milestone Sep 11, 2025
@mgiammar mgiammar linked an issue Sep 11, 2025 that may be closed by this pull request
Copy link
Contributor

@jdickerson95 jdickerson95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the unnecessary .cpu() (unless they are necessary) and then I'm happy for it to be merged.

@jdickerson95 jdickerson95 merged commit d5bcf92 into Lucaslab-Berkeley:development_v1.1 Sep 20, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for native multi-node execution

2 participants