Add FlightRecorder tests #1971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

frost-intel wants to merge 6 commits into main from frost/fr_tests

Contributor

frost-intel commented Aug 27, 2025

As a follow-up to #1867 , this PR includes tests for the FlightRecorder on XCCL, as well as moving some definitions from ProcessGroupXCCL::Options to Backend::Options.

These tests are largely based on pytorch/test/distributed/test_c10d_nccl.py, but doesn't include some tests:

test_short_json since json dumps are not supported in ProcessGroupXCCL
test_trace_while_all_works_retired: _wait_for_pending_works isn't supported by XCCL
test_trace_while_active: XCCL hangs when op is called on only one rank
test_trace_while_stuck: XCCL hangs when op is called on only one rank

frost-intel added 2 commits

August 26, 2025 16:28


          FlightRecorder tests

536be59


          More tests with fixes

c8625f6

frost-intel requested review from zhangxiaoli73 and guangyey

August 27, 2025 20:32


          lint

4a3cbad

frost-intel marked this pull request as ready for review

August 27, 2025 20:36

frost-intel added 2 commits

August 27, 2025 16:36


          Merge branch 'main' into frost/fr_tests

4cb5a53


          Merge branch 'main' into frost/fr_tests

c29cfe4

Copilot AI review requested due to automatic review settings

August 28, 2025 12:32

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR adds FlightRecorder tests for XCCL (Intel XPU Collective Communications Library) as a follow-up to #1867. The tests validate flight recording functionality for distributed operations on Intel XPU devices, including trace dumping, timing, and various collective operations.

Adds comprehensive test suite for XCCL FlightRecorder functionality based on NCCL tests
Moves global_ranks_in_group and group_name from ProcessGroupXCCL::Options to Backend::Options
Adds conditional recording parameter to initWork method to control when flight recording occurs

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
test/xpu/distributed/test_c10d_xccl.py	Adds XCCLTraceTestBase and XCCLTraceTest classes with comprehensive flight recorder tests
src/xccl/ProcessGroupXCCL.hpp	Removes group-specific options and adds record parameter to initWork method
src/xccl/ProcessGroupXCCL.cpp	Implements conditional flight recording and fixes sequence counting logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

test/xpu/distributed/test_c10d_xccl.py

+                      self._verify_trace(
+                          t,
+                          include_collectives=include_collectives,
+                          is_json=True,

Preview

Copilot AI Aug 28, 2025

The is_json parameter is set to True but the test is using pickle format, not JSON. This should be False since pickle.loads() is used to deserialize the trace data.

Suggested change

      
                        is_json=True,
          
                        is_json=False,

Copilot uses AI. Check for mistakes.

test/xpu/distributed/test_c10d_xccl.py

Comment on lines +1100 to +1107

+                      output_tensor = torch.zeros(sum_len, 2).to(self.rank)
+                      expected_tensor = torch.ones(sum_len, 2).to(self.rank)
+                      input_tensor = torch.ones(output_split_sizes[self.rank], 2).to(self.rank)
+                      dist.all_gather(
+                          list(torch.split(output_tensor, output_split_sizes)), input_tensor
+                      )
+                      torch.xpu.synchronize(device=self.rank)

Preview

Copilot AI Aug 28, 2025

Using self.rank as device argument to .to() is incorrect. It should use self.local_device to properly specify the XPU device, similar to other tests in this file.

Suggested change

      
                    output_tensor = torch.zeros(sum_len, 2).to(self.rank)
          
                    expected_tensor = torch.ones(sum_len, 2).to(self.rank)
          
                    input_tensor = torch.ones(output_split_sizes[self.rank], 2).to(self.rank)
          
                    dist.all_gather(
          
                        list(torch.split(output_tensor, output_split_sizes)), input_tensor
          
                    )
          
                    torch.xpu.synchronize(device=self.rank)
          
                    output_tensor = torch.zeros(sum_len, 2).to(self.local_device)
          
                    expected_tensor = torch.ones(sum_len, 2).to(self.local_device)
          
                    input_tensor = torch.ones(output_split_sizes[self.rank], 2).to(self.local_device)
          
                    dist.all_gather(
          
                        list(torch.split(output_tensor, output_split_sizes)), input_tensor
          
                    )
          
                    torch.xpu.synchronize(device=self.local_device)

Copilot uses AI. Check for mistakes.

test/xpu/distributed/test_c10d_xccl.py

+                      dist.all_gather(
+                          list(torch.split(output_tensor, output_split_sizes)), input_tensor
+                      )
+                      torch.xpu.synchronize(device=self.rank)

Preview

Copilot AI Aug 28, 2025

Using self.rank as device argument is incorrect. It should use self.local_device to properly specify the XPU device for synchronization.

Suggested change

      
                    torch.xpu.synchronize(device=self.rank)
          
                    torch.xpu.synchronize(device=self.local_device)

Copilot uses AI. Check for mistakes.

test/xpu/distributed/test_c10d_xccl.py

Comment on lines +1147 to +1148

		output_tensors = torch.zeros(2, 2).to(self.rank)
		input_tensors = [torch.ones(2, 2).to(self.rank) for _ in range(self.world_size)]

Preview

Copilot AI Aug 28, 2025

Using self.rank as device argument to .to() is incorrect. It should use self.local_device to properly specify the XPU device.

Suggested change

      
                    output_tensors = torch.zeros(2, 2).to(self.rank)
          
                    input_tensors = [torch.ones(2, 2).to(self.rank) for _ in range(self.world_size)]
          
                    output_tensors = torch.zeros(2, 2).to(self.local_device)
          
                    input_tensors = [torch.ones(2, 2).to(self.local_device) for _ in range(self.world_size)]

Copilot uses AI. Check for mistakes.

test/xpu/distributed/test_c10d_xccl.py

+                              dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
+                      self.assertEqual(output_tensors, input_tensors[self.rank] * self.world_size)
+                      torch.xpu.synchronize(device=self.rank)

Preview

Copilot AI Aug 28, 2025

Using self.rank as device argument is incorrect. It should use self.local_device to properly specify the XPU device for synchronization.

Suggested change

      
                    torch.xpu.synchronize(device=self.rank)
          
                    torch.xpu.synchronize(device=self.local_device)

Copilot uses AI. Check for mistakes.

zhangxiaoli73 reviewed

View reviewed changes

src/xccl/ProcessGroupXCCL.hpp

@@ @@ -171,7 +168,8 @@ class TORCH_API ProcessGroupXCCL : public Backend { @@
                     bool isP2P,
                     const char* profilingTitle = nullptr,
                     const std::vector<at::Tensor>& inputs = {},
-                    const std::vector<at::Tensor>& outputs = {});
+                    const std::vector<at::Tensor>& outputs = {},
+                    bool record = false);

zhangxiaoli73 Aug 29, 2025

Does it follow the same logic like NCCL backend?

Contributor Author

frost-intel Aug 29, 2025

Yes, NCCL uses the same logic. Sometimes initWork needs to be recorded, but in other cases it causes improper access in FlightRecorder.

zhangxiaoli73 reviewed

View reviewed changes

src/xccl/ProcessGroupXCCL.cpp

                 auto device = inputs[0].device();
                 const auto key = std::to_string(device.index());
                 auto comm = getXCCLComm(key, device, opType);
+                if (!coalescing_state_) {

zhangxiaoli73 Aug 29, 2025

@Chao1Han Please help check why this condition is not needed previously.

Contributor

Chao1Han Aug 29, 2025

frost’s modification is correct, this is indeed a bug. The community fixed it ten months ago, but since we had never dumped seqCollective_, we failed to notice it.

Contributor

Chao1Han commented Aug 29, 2025 •

edited

Loading

Hi @frost-intel, could you apt install clang-format and format ProcessGroupXCCL.cpp/ProcessGroupXCCL.hpp before push code.
Avoid periodic code cleanup like https://github.com/intel/torch-xpu-ops/pull/1960/files


          lint

ef16520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet