Skip to content

Add FlightRecorder support for ProcessGroupXCCL #1867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 15, 2025
Merged

Conversation

frost-intel
Copy link
Contributor

This PR provides initial support for FlightRecorder, which allows debug trace dumps for distributed jobs.

Features added:

  1. Heartbeat Monitor thread which regularly checks if a dump signal has been received via pipe files, and on trigger writes traces to file (for each rank seperately)
  2. Logic to record XCCL work events

Compared to NCCL, we don't have some features. These could be added in a later PR:

  1. No support for event timing/duration
  2. No support for Watchdog thread which allows for debug dump on error or timeout, with other additional features (remote error detection, etc)

@Copilot Copilot AI review requested due to automatic review settings July 17, 2025 15:08
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Enhance ProcessGroupXCCL with FlightRecorder support to enable on-demand debug trace dumps via named pipes and record XCCL events.

  • Added DumpPipe and HeartbeatMonitor to watch for dump signals and write traces.
  • Integrated FlightRecorder calls into collective and point-to-point workflows.
  • Introduced Options struct for configuring group metadata and initialized fmt in CMake.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/xccl/ProcessGroupXCCL.hpp Added DumpPipe, HeartbeatMonitor, trace fields, and new API methods
src/xccl/ProcessGroupXCCL.cpp Implemented heartbeat thread, dumpDebuggingInfo, and event recording
src/xccl/FlightRecorderXCCL.cpp Specialized FlightRecorder for XPUEvent
src/BuildOnLinux.cmake Linked fmt::fmt-header-only for XCCL builds
Comments suppressed due to low confidence (2)

src/xccl/ProcessGroupXCCL.hpp:475

  • The globalRank() method is declared but no implementation is provided, leading to a linker error. Please implement it or remove the declaration.
  const int& globalRank() const;

src/xccl/ProcessGroupXCCL.hpp:163

  • std::optional is used here but <optional> is not included. Add #include <optional> to ensure the header compiles independently.
    std::optional<uint64_t> trace_id_;

@zhangxiaoli73
Copy link

@frost-intel Could you please add a UT to cover this new feature?

@guangyey
Copy link
Contributor

@frost-intel please help fix build error :

48Z   435 |   const c10::intrusive_ptr<Backend::Options> options_;
2025-08-08T23:01:22.8721066Z       |                                              ^~~~~~~~
2025-08-08T23:01:22.8722653Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.hpp:420:12: warning:   ‘uint64_t c10d::ProcessGroupXCCL::xcclCommCounter_’ [-Wreorder]
2025-08-08T23:01:22.8723989Z   420 |   uint64_t xcclCommCounter_{0};
2025-08-08T23:01:22.8724730Z       |            ^~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8725961Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:346:1: warning:   when initialized here [-Wreorder]
2025-08-08T23:01:22.8727211Z   346 | ProcessGroupXCCL::ProcessGroupXCCL(
2025-08-08T23:01:22.8727954Z       | ^~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8729580Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp: In member function ‘const std::vector<long unsigned int>& c10d::ProcessGroupXCCL::groupRanks() const’:
2025-08-08T23:01:22.8731797Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:411:17: error: ‘struct c10d::Backend::Options’ has no member named ‘global_ranks_in_group’
2025-08-08T23:01:22.8733238Z   411 |   if (options_->global_ranks_in_group.empty() && local_id_ == 0) {
2025-08-08T23:01:22.8734170Z       |                 ^~~~~~~~~~~~~~~~~~~~~
2025-08-08T23:01:22.8735744Z /home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/src/xccl/ProcessGroupXCCL.cpp:416:20: error: ‘struct c10d::Backend::Options’ has no member named ‘global_ranks_in_group’
2025-08-08T23:01:22.8737110Z   416 |   return options_->global_ranks_in_group;

Copy link
Contributor

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@guangyey guangyey requested a review from zhangxiaoli73 August 14, 2025 03:33
@guangyey
Copy link
Contributor

Let's land this PR to prevent blocking upstream PR in PyTorch.

@guangyey guangyey enabled auto-merge August 15, 2025 03:26
@guangyey guangyey added this pull request to the merge queue Aug 15, 2025
Merged via the queue into main with commit 77cc792 Aug 15, 2025
81 of 84 checks passed
@guangyey guangyey deleted the frost/flight_recorder branch August 15, 2025 03:43
@guangyey
Copy link
Contributor

cc @frost-intel
This PR will be included in pytorch/pytorch#160062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants