Skip to content

[wip] Symm support #1922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
57d67ea
add XPU symm
zhangxiaoli73 Jun 4, 2025
5756399
correct include
zhangxiaoli73 Jun 4, 2025
bcf9cc4
remove XPUGuard
zhangxiaoli73 Jun 4, 2025
c9d3e38
debug
zhangxiaoli73 Jun 4, 2025
13e0f21
debug
zhangxiaoli73 Jun 9, 2025
762ccfd
debug
zhangxiaoli73 Jun 9, 2025
0661f94
debug
zhangxiaoli73 Jun 9, 2025
4cf53e0
debug
zhangxiaoli73 Jun 9, 2025
9abc672
debug
zhangxiaoli73 Jun 9, 2025
2713e8b
debug
zhangxiaoli73 Jun 9, 2025
99ff4ce
debug
zhangxiaoli73 Jun 9, 2025
f638c3d
debug
zhangxiaoli73 Jun 9, 2025
33fdeb1
debug
zhangxiaoli73 Jun 10, 2025
413c718
add debug logs
zhangxiaoli73 Jun 10, 2025
b4d4dd4
check device type
zhangxiaoli73 Jun 12, 2025
8e9dc93
debug async ops
zhangxiaoli73 Jun 12, 2025
ba1e1d4
debug
zhangxiaoli73 Jun 13, 2025
4a71bfa
refine to void*
zhangxiaoli73 Jun 13, 2025
01a8c69
debug
zhangxiaoli73 Jun 13, 2025
bbae412
debug copy
zhangxiaoli73 Jun 13, 2025
da8b020
debug sharded handle
zhangxiaoli73 Jun 16, 2025
055b038
debug
zhangxiaoli73 Jun 17, 2025
b0fc7e8
debug
zhangxiaoli73 Jun 17, 2025
b118c92
enable torch-ccl exchange
zhangxiaoli73 Jun 17, 2025
3d9acd2
remove unneeded
zhangxiaoli73 Jun 17, 2025
29a74cd
fix a bug and move to local IPC exchange
zhangxiaoli73 Jun 19, 2025
f9ae0d5
add symm copy_buffer API
zhangxiaoli73 Jun 19, 2025
1973dfe
barrier with MPI
zhangxiaoli73 Jun 23, 2025
2b9edef
support arc
zhangxiaoli73 Jun 24, 2025
e269c12
refine barrier
zhangxiaoli73 Jun 24, 2025
9c36685
workaroud barrier
zhangxiaoli73 Jun 25, 2025
369fe16
refine ipc exchange
zhangxiaoli73 Jul 3, 2025
226dab1
refine ipc exchange
zhangxiaoli73 Jul 10, 2025
b6fb65a
reabse and then workaround barrier
zhangxiaoli73 Jul 15, 2025
65643c3
impl base class virtual function
zhangxiaoli73 Jul 17, 2025
3fbe388
format
Chao1Han Jul 31, 2025
4ce540f
Merge remote-tracking branch 'origin/main' into symm-support-bak
Chao1Han Jul 31, 2025
eeb81ae
rm hardcode ze
Chao1Han Jul 31, 2025
b756e41
clean code
Chao1Han Aug 5, 2025
ac8015d
update
Chao1Han Aug 7, 2025
b6199ad
avoid symbol conflict
Chao1Han Aug 7, 2025
d7130fe
refine IPCExchang
Chao1Han Aug 8, 2025
4065ad1
rm header
Chao1Han Aug 8, 2025
5345b3c
update
Chao1Han Aug 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
446 changes: 446 additions & 0 deletions src/xccl/IPCExchange.hpp

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/xccl/ProcessGroupXCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1911,7 +1911,7 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::barrier(const BarrierOptions& opts) {
}

auto currentStream = at::xpu::getCurrentXPUStream(barDevIdx);
currentStream.synchronize();
// currentStream.synchronize(); // zl_debug workaround for symm barrier
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out synchronization appears to be a temporary workaround. This should be properly addressed before production deployment as it may affect correctness.

Suggested change
// currentStream.synchronize(); // zl_debug workaround for symm barrier
currentStream.synchronize(); // Ensure stream synchronization for barrier

Copilot uses AI. Check for mistakes.

return nullptr;
}

Expand Down
Loading
Loading