-
Notifications
You must be signed in to change notification settings - Fork 49
[wip] Symm support #1922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[wip] Symm support #1922
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds symmetric memory support for XPU devices. The implementation provides Level Zero-based memory allocation, inter-process communication (IPC), and symmetric memory management functionality.
- Adds Level Zero exception handling and dynamic library loading for XPU
- Implements XPU-specific symmetric memory allocator and memory management classes
- Provides IPC-based memory exchange mechanisms for multi-process communication
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 13 comments.
Show a summary per file
File | Description |
---|---|
ze_exception.hpp | Level Zero exception handling and dynamic library loading utilities |
XPUSymmetricMemoryTypes.hpp | Type definitions for XPU symmetric memory implementation |
XPUSymmetricMemoryUtils.hpp | Utility classes for symmetric memory operations and IPC communication |
XPUSymmetricMemoryUtils.cpp | Implementation of IPC channels and memory mapping utilities |
XPUSymmetricMemory.hpp | Main symmetric memory class declarations for XPU |
XPUSymmetricMemory.cpp | Implementation of XPU symmetric memory allocator and management |
ProcessGroupXCCL.cpp | Temporary workaround for barrier synchronization |
IPCExchange.hpp | Inter-process communication utilities for memory handle exchange |
if (ze_handle != nullptr) { | ||
return true; | ||
} | ||
const char* lib_names[] = {"/usr/lib/x86_64-linux-gnu/libze_loader.so"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded library path is platform-specific and may not work across different Linux distributions or architectures. Consider adding multiple common library paths or using a more flexible discovery mechanism.
const char* lib_names[] = {"/usr/lib/x86_64-linux-gnu/libze_loader.so"}; | |
const char* lib_names[] = { | |
"/usr/lib/x86_64-linux-gnu/libze_loader.so", | |
"/usr/lib64/libze_loader.so", | |
"/usr/lib/libze_loader.so", | |
"/usr/local/lib/libze_loader.so", | |
"libze_loader.so" | |
}; |
Copilot uses AI. Check for mistakes.
ze_result_t result = (x); \ | ||
if (result != ZE_RESULT_SUCCESS) { \ | ||
auto e = zeException(result); \ | ||
std::cout << "Throw " << e.what() << std::endl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::cout for error logging in a library is not recommended. Consider using a proper logging framework or removing debug output from production code.
Copilot uses AI. Check for mistakes.
#define zeCheck(x) \ | ||
if (x != ZE_RESULT_SUCCESS) { \ | ||
auto e = zeException(x); \ | ||
std::cout << "Throw " << e.what() << std::endl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::cout for error logging in a library is not recommended. Consider using a proper logging framework or removing debug output from production code.
Copilot uses AI. Check for mistakes.
size_t seq_id_ = 0; | ||
}; | ||
|
||
// Teturns a pointer of virtual address that is mapped to the physical memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: 'Teturns' should be 'Returns'.
// Teturns a pointer of virtual address that is mapped to the physical memory | |
// Returns a pointer of virtual address that is mapped to the physical memory |
Copilot uses AI. Check for mistakes.
src/xccl/XPUSymmetricMemoryUtils.cpp
Outdated
sycl::context sycl_ctx = current_queue.get_context(); | ||
ze_context_handle_t ze_context = | ||
sycl::get_native<sycl::backend::ext_oneapi_level_zero>(sycl_ctx); | ||
std::cout << "zl_debug in map_block to get virtual address " << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug output should be removed from production code. Consider using a proper logging framework or removing these debug statements.
std::cout << "zl_debug in map_block to get virtual address " << std::endl; |
Copilot uses AI. Check for mistakes.
src/xccl/XPUSymmetricMemory.cpp
Outdated
c10::Device local_device(c10::DeviceType::XPU, local_device_idx); | ||
c10::DeviceGuard guard(local_device); | ||
|
||
// todo: zl_debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment should be addressed or removed before merging to production.
// todo: zl_debug |
Copilot uses AI. Check for mistakes.
#include <sys/socket.h> | ||
#include <unistd.h> | ||
|
||
// todo: fixed with kernel barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment should be addressed or removed before merging to production.
// todo: fixed with kernel barrier |
Copilot uses AI. Check for mistakes.
@@ -1911,7 +1911,7 @@ c10::intrusive_ptr<Work> ProcessGroupXCCL::barrier(const BarrierOptions& opts) { | |||
} | |||
|
|||
auto currentStream = at::xpu::getCurrentXPUStream(barDevIdx); | |||
currentStream.synchronize(); | |||
// currentStream.synchronize(); // zl_debug workaround for symm barrier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out synchronization appears to be a temporary workaround. This should be properly addressed before production deployment as it may affect correctness.
// currentStream.synchronize(); // zl_debug workaround for symm barrier | |
currentStream.synchronize(); // Ensure stream synchronization for barrier |
Copilot uses AI. Check for mistakes.
~allreducer() { | ||
if (initialized) { | ||
std::cerr << "Warning: allreducer destroyed without calling release()" | ||
<< std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::cerr for warnings in a library may not be appropriate. Consider using a proper logging framework or throwing an exception if cleanup is critical.
<< std::endl; | |
throw std::runtime_error("allreducer destroyed without calling release()"); |
Copilot uses AI. Check for mistakes.
} | ||
} | ||
} catch (const std::exception& e) { | ||
std::cerr << "Warning: Level Zero cleanup failed: " << e.what() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using std::cerr for warnings in a library may not be appropriate. Consider using a proper logging framework.
Copilot uses AI. Check for mistakes.
No description provided.