-
Notifications
You must be signed in to change notification settings - Fork 49
Add FlightRecorder support for ProcessGroupXCCL #1867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Enhance ProcessGroupXCCL with FlightRecorder support to enable on-demand debug trace dumps via named pipes and record XCCL events.
- Added DumpPipe and HeartbeatMonitor to watch for dump signals and write traces.
- Integrated FlightRecorder calls into collective and point-to-point workflows.
- Introduced
Options
struct for configuring group metadata and initialized fmt in CMake.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
src/xccl/ProcessGroupXCCL.hpp | Added DumpPipe, HeartbeatMonitor, trace fields, and new API methods |
src/xccl/ProcessGroupXCCL.cpp | Implemented heartbeat thread, dumpDebuggingInfo, and event recording |
src/xccl/FlightRecorderXCCL.cpp | Specialized FlightRecorder for XPUEvent |
src/BuildOnLinux.cmake | Linked fmt::fmt-header-only for XCCL builds |
Comments suppressed due to low confidence (2)
src/xccl/ProcessGroupXCCL.hpp:475
- The
globalRank()
method is declared but no implementation is provided, leading to a linker error. Please implement it or remove the declaration.
const int& globalRank() const;
src/xccl/ProcessGroupXCCL.hpp:163
std::optional
is used here but<optional>
is not included. Add#include <optional>
to ensure the header compiles independently.
std::optional<uint64_t> trace_id_;
@frost-intel Could you please add a UT to cover this new feature? |
@frost-intel please help fix build error : 48Z 435 | const c10::intrusive_ptr<Backend::Options> options_;
2025-08-08T23:01:22.8721066Z | ^~~~~~~~
2025-08-08T23:01:22.8722653Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.hpp:420:12: warning: ‘uint64_t c10d::ProcessGroupXCCL::xcclCommCounter_’ [-Wreorder]
2025-08-08T23:01:22.8723989Z 420 | uint64_t xcclCommCounter_{0};
2025-08-08T23:01:22.8724730Z | ^~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8725961Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:346:1: warning: when initialized here [-Wreorder]
2025-08-08T23:01:22.8727211Z 346 | ProcessGroupXCCL::ProcessGroupXCCL(
2025-08-08T23:01:22.8727954Z | ^~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8729580Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp: In member function ‘const std::vector<long unsigned int>& c10d::ProcessGroupXCCL::groupRanks() const’:
2025-08-08T23:01:22.8731797Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:411:17: error: ‘struct c10d::Backend::Options’ has no member named ‘global_ranks_in_group’
2025-08-08T23:01:22.8733238Z 411 | if (options_->global_ranks_in_group.empty() && local_id_ == 0) {
2025-08-08T23:01:22.8734170Z | ^~~~~~~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8735744Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:416:20: error: ‘struct c10d::Backend::Options’ has no member named ‘global_ranks_in_group’
2025-08-08T23:01:22.8737110Z 416 | return options_->global_ranks_in_group; |
This reverts commit 68a889b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Let's land this PR to prevent blocking upstream PR in PyTorch. |
cc @frost-intel |
This PR provides initial support for FlightRecorder, which allows debug trace dumps for distributed jobs.
Features added:
Compared to NCCL, we don't have some features. These could be added in a later PR: