Skip to content

[wip] Symm support #1922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open

[wip] Symm support #1922

wants to merge 44 commits into from

Conversation

Chao1Han
Copy link
Contributor

No description provided.

@Copilot Copilot AI review requested due to automatic review settings August 11, 2025 02:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds symmetric memory support for XPU devices. The implementation provides Level Zero-based memory allocation, inter-process communication (IPC), and symmetric memory management functionality.

  • Adds Level Zero exception handling and dynamic library loading for XPU
  • Implements XPU-specific symmetric memory allocator and memory management classes
  • Provides IPC-based memory exchange mechanisms for multi-process communication

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
ze_exception.hpp Level Zero exception handling and dynamic library loading utilities
XPUSymmetricMemoryTypes.hpp Type definitions for XPU symmetric memory implementation
XPUSymmetricMemoryUtils.hpp Utility classes for symmetric memory operations and IPC communication
XPUSymmetricMemoryUtils.cpp Implementation of IPC channels and memory mapping utilities
XPUSymmetricMemory.hpp Main symmetric memory class declarations for XPU
XPUSymmetricMemory.cpp Implementation of XPU symmetric memory allocator and management
ProcessGroupXCCL.cpp Temporary workaround for barrier synchronization
IPCExchange.hpp Inter-process communication utilities for memory handle exchange

if (ze_handle != nullptr) {
return true;
}
const char* lib_names[] = {"/usr/lib/x86_64-linux-gnu/libze_loader.so"};
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded library path is platform-specific and may not work across different Linux distributions or architectures. Consider adding multiple common library paths or using a more flexible discovery mechanism.

Suggested change
const char* lib_names[] = {"/usr/lib/x86_64-linux-gnu/libze_loader.so"};
const char* lib_names[] = {
"/usr/lib/x86_64-linux-gnu/libze_loader.so",
"/usr/lib64/libze_loader.so",
"/usr/lib/libze_loader.so",
"/usr/local/lib/libze_loader.so",
"libze_loader.so"
};

Copilot uses AI. Check for mistakes.

ze_result_t result = (x); \
if (result != ZE_RESULT_SUCCESS) { \
auto e = zeException(result); \
std::cout << "Throw " << e.what() << std::endl; \
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::cout for error logging in a library is not recommended. Consider using a proper logging framework or removing debug output from production code.

Copilot uses AI. Check for mistakes.

#define zeCheck(x) \
if (x != ZE_RESULT_SUCCESS) { \
auto e = zeException(x); \
std::cout << "Throw " << e.what() << std::endl; \
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::cout for error logging in a library is not recommended. Consider using a proper logging framework or removing debug output from production code.

Copilot uses AI. Check for mistakes.

size_t seq_id_ = 0;
};

// Teturns a pointer of virtual address that is mapped to the physical memory
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: 'Teturns' should be 'Returns'.

Suggested change
// Teturns a pointer of virtual address that is mapped to the physical memory
// Returns a pointer of virtual address that is mapped to the physical memory

Copilot uses AI. Check for mistakes.

sycl::context sycl_ctx = current_queue.get_context();
ze_context_handle_t ze_context =
sycl::get_native<sycl::backend::ext_oneapi_level_zero>(sycl_ctx);
std::cout << "zl_debug in map_block to get virtual address " << std::endl;
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug output should be removed from production code. Consider using a proper logging framework or removing these debug statements.

Suggested change
std::cout << "zl_debug in map_block to get virtual address " << std::endl;

Copilot uses AI. Check for mistakes.

c10::Device local_device(c10::DeviceType::XPU, local_device_idx);
c10::DeviceGuard guard(local_device);

// todo: zl_debug
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment should be addressed or removed before merging to production.

Suggested change
// todo: zl_debug

Copilot uses AI. Check for mistakes.

#include <sys/socket.h>
#include <unistd.h>

// todo: fixed with kernel barrier
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment should be addressed or removed before merging to production.

Suggested change
// todo: fixed with kernel barrier

Copilot uses AI. Check for mistakes.

@@ -1911,7 +1911,7 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::barrier(const BarrierOptions& opts) {
}

auto currentStream = at::xpu::getCurrentXPUStream(barDevIdx);
currentStream.synchronize();
// currentStream.synchronize(); // zl_debug workaround for symm barrier
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out synchronization appears to be a temporary workaround. This should be properly addressed before production deployment as it may affect correctness.

Suggested change
// currentStream.synchronize(); // zl_debug workaround for symm barrier
currentStream.synchronize(); // Ensure stream synchronization for barrier

Copilot uses AI. Check for mistakes.

~allreducer() {
if (initialized) {
std::cerr << "Warning: allreducer destroyed without calling release()"
<< std::endl;
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::cerr for warnings in a library may not be appropriate. Consider using a proper logging framework or throwing an exception if cleanup is critical.

Suggested change
<< std::endl;
throw std::runtime_error("allreducer destroyed without calling release()");

Copilot uses AI. Check for mistakes.

}
}
} catch (const std::exception& e) {
std::cerr << "Warning: Level Zero cleanup failed: " << e.what()
Copy link
Preview

Copilot AI Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::cerr for warnings in a library may not be appropriate. Consider using a proper logging framework.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants