Skip to content

Conversation

Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Oct 15, 2025

Usage

    pg = dist.distributed_c10d._get_default_group()
    pg._enable_collectives_timing()
    x = torch.ones([2, 2]).to(device)
    num_repeats = 10
    for _ in range(num_repeats):
        dist.all_reduce(x)
    time.sleep(1)
    t = pickle.loads(torch._C._distributed_c10d._dump_xccl_trace())
    for seq in range(num_repeats):
        duration = t["entries"][seq]["duration_ms"]
        print(duration)

@Copilot Copilot AI review requested due to automatic review settings October 15, 2025 05:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds time event support to the XCCL (XPU Collective Communication Library) system by introducing event caching and timing capabilities. The changes enable performance measurement and event management for XPU operations through a caching mechanism.

Key changes:

  • Introduces XPUEventCache class for efficient event object reuse and timing support
  • Adds timing functionality to WorkXCCL with start/end events and duration calculation
  • Updates point-to-point communication operations to support timing and preprocessing/postprocessing hooks

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/xccl/XPUEventCache.hpp Defines the XPUEventCache class interface for managing cached XPU events
src/xccl/XPUEventCache.cpp Implements event caching logic with timing support and thread-local device mapping
src/xccl/ProcessGroupXCCL.hpp Adds timing support fields and template method overloads for point-to-point operations
src/xccl/ProcessGroupXCCL.cpp Integrates event caching, timing functionality, and refactors point-to-point operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// new one.
if (!events.empty()) {
event = events.front();
events.pop_front();
Copy link

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider checking if the cached event is still valid or resetting its state before reusing it, as events may retain previous state that could affect timing accuracy.

Suggested change
events.pop_front();
events.pop_front();
// Reset the event's state before reuse
event->reset();

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants