-
Notifications
You must be signed in to change notification settings - Fork 430
Removes MPI dependency from MNNVL AllReduce #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pranavm-nvidia, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've refactored the MNNVL AllReduce kernel to remove its dependency on MPI, opting instead for torch.distributed
. This change streamlines the communication layer, making it more consistent with PyTorch's ecosystem and potentially simplifying multi-node setups. The update involves modifying core communication logic, adjusting JIT environment paths, and adapting unit tests to the new distributed framework.
Highlights
- Communication Backend Migration: The core change involves migrating the communication backend for MNNVL AllReduce from
mpi4py
totorch.distributed
, aligning it with PyTorch's native distributed capabilities. MpiComm
Class Refactoring: TheMpiComm
class has been refactored and its usage updated to integrate seamlessly withtorch.distributed
's process groups, specifically by passing agroup
object to relevant constructors and functions.- Collective Operations Update: All collective communication operations, such as all-gather and broadcast, have been transitioned from
mpi4py
's methods totorch.distributed
'sall_gather_object
andbroadcast_object_list
. - JIT Environment Path Corrections: The JIT environment paths for various dependencies like CUTLASS and SPDLOG have been corrected to point to the
3rdparty
directory, resolving previous non-existent directory issues. - Unit Test Adaptation: The multi-node all-reduce unit tests have been updated to initialize and utilize
torch.distributed
for communication and synchronization, including specific handling for SLURM environments.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Warning Gemini encountered an error creating the review. You can try again by commenting |
# TODO (pranavm): Check if this is right? | ||
# Why were these pointing to non-existent directories? Must be a copy missing somewhere? | ||
_package_root = pathlib.Path(__file__).resolve().parents[2] | ||
FLASHINFER_DATA = _package_root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @yzh119
# FLASHINFER_SRC_DIR = _package_root / "data" / "src" | ||
FLASHINFER_TVM_BINDING_DIR = _package_root / "data" / "tvm_binding" | ||
FLASHINFER_AOT_DIR = _package_root / "data" / "aot" | ||
# TODO (pranavm): Check if this is right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pranavm-nvidia , we shouldn't change the logic here.
When you install flashinfer (using pip install), the data folder will be created:
Lines 64 to 86 in 40471bf
[tool.setuptools.package-dir] | |
"flashinfer.data" = "." | |
"flashinfer.data.aot" = "build/aot-ops-package-dir" | |
"flashinfer.data.cutlass" = "3rdparty/cutlass" | |
"flashinfer.data.spdlog" = "3rdparty/spdlog" | |
[tool.setuptools.package-data] | |
"flashinfer.data" = [ | |
"csrc/**", | |
"include/**", | |
"tvm_binding/**", | |
"version.txt" | |
] | |
"flashinfer.data.aot" = [ | |
"**" | |
] | |
"flashinfer.data.cutlass" = [ | |
"include/**", | |
"tools/util/include/**" | |
] | |
"flashinfer.data.spdlog" = [ | |
"include/**", | |
] |
all of the original paths such as _package_root / "include"
or _package_root / "3rdparty"
will be gone when you build the packages.
If you develop locally in editable mode, you will not see the difference because your package root is the repository path, but if you install flashinfer package on pypi on a new environment, you will find there is no include
/3rdparty
path in flashinfer package directory, and that why there is a data
path designed to handle these source code dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I had a feeling it was a setup issue on my end. What's the correct way to test local changes then? Do I always need to reinstall the package when I make a change? I was so far just setting PYTHONPATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the correct way to test local changes then? Do I always need to reinstall the package when I make a change?
You can make editable installations:
pip install -e . -v
and you don't have to reinstall the package for source code changes (because it's editable).
📌 Description
This change removes the MPI dependency from the MNNVL AllReduce kernel, using
torch.distributed
instead.🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commit
by runningpip install pre-commit
(or used your preferred method).pre-commit install
.pre-commit run --all-files
and fixed any reported issues.🧪 Tests
unittest
, etc.).Reviewer Notes
There are two hacks here I want to call out: