-
Notifications
You must be signed in to change notification settings - Fork 532
[store] zero copy for get_tensor() and batch_get_tensor() #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @zxpdemonio, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the data retrieval process for tensors within the Mooncake store's Python integration. By introducing a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a 'zero-copy' optimization for get_tensor() and batch_get_tensor() by adding a new get_allocated_internal method. This avoids an extra memory copy when retrieving tensors. The changes to get_tensor() and the new C++ backend methods look good, but the implementation for batch_get_tensor() has some critical issues, including double data fetching, a performance regression due to loss of batching, and incorrect GIL handling. I've also suggested an improvement in the new get_allocated_internal function to use modern C++ for safer memory management.
mooncake-store/src/real_client.cpp
Outdated
| char *data_ptr = new char[total_length]; | ||
| if (!data_ptr) { | ||
| LOG(ERROR) << "Failed to allocate memory for length: " << total_length; | ||
| return tl::unexpected(ErrorCode::INTERNAL_ERROR); | ||
| } | ||
|
|
||
| // register the buffer | ||
| auto register_result = register_buffer_internal( | ||
| reinterpret_cast<void *>(data_ptr), total_length); | ||
| if (!register_result) { | ||
| LOG(ERROR) << "Failed to register buffer"; | ||
| delete[] data_ptr; | ||
| return tl::unexpected(register_result.error()); | ||
| } | ||
|
|
||
| // Create slices for the allocated buffer | ||
| std::vector<Slice> slices; | ||
| allocateSlices(slices, replica, data_ptr); | ||
|
|
||
| // Get the object data | ||
| auto get_result = client_->Get(key, query_result.value(), slices); | ||
|
|
||
| // unregister the buffer for whatever cases | ||
| auto unregister_result = | ||
| unregister_buffer_internal(reinterpret_cast<void *>(data_ptr)); | ||
| if (!unregister_result) { | ||
| LOG(WARNING) << "Failed to unregister buffer"; | ||
| } | ||
|
|
||
| if (!get_result) { | ||
| delete[] data_ptr; | ||
| LOG(ERROR) << "Get failed for key: " << key; | ||
| return tl::unexpected(get_result.error()); | ||
| } | ||
|
|
||
| // return the data ptr transferring the ownership to the caller | ||
| data_length = total_length; | ||
| return data_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function manually manages the memory for data_ptr using new[] and delete[]. This can be error-prone, especially with multiple return paths. Using std::unique_ptr would leverage RAII for automatic memory management, making the code safer and cleaner by removing the need for manual delete[] calls in error paths.
auto data_ptr = std::make_unique<char[]>(total_length);
// register the buffer
auto register_result = register_buffer_internal(
reinterpret_cast<void *>(data_ptr.get()), total_length);
if (!register_result) {
LOG(ERROR) << "Failed to register buffer";
return tl::unexpected(register_result.error());
}
// Create slices for the allocated buffer
std::vector<Slice> slices;
allocateSlices(slices, replica, data_ptr.get());
// Get the object data
auto get_result = client_->Get(key, query_result.value(), slices);
// unregister the buffer for whatever cases
auto unregister_result =
unregister_buffer_internal(reinterpret_cast<void *>(data_ptr.get()));
if (!unregister_result) {
LOG(WARNING) << "Failed to unregister buffer";
}
if (!get_result) {
LOG(ERROR) << "Get failed for key: " << key;
return tl::unexpected(get_result.error());
}
// return the data ptr transferring the ownership to the caller
data_length = total_length;
return data_ptr.release();|
I will fix the code format. |
5156f14 to
9a1539d
Compare
9c82bb6 to
4fa75d6
Compare
|
@zxpdemonio Don't forget to use |
|
test result for batch_get_tensor_into()(test case 3), comparing with standard batch get (test case1) and tp batch get( test case 2) test_benchmark_01_batch_put_get (main.TestMooncakeBenchmark.test_benchmark_01_batch_put_get) |
|
add zero copy for get tensor with tp: Benchmark: Standard Batch Put/Get. ... --- Running zero copy Batch Benchmark (TP=4, 5 iters) --- |
1a174cd to
8536b17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements zero-copy tensor retrieval operations for the Mooncake distributed store. The main enhancement is adding new *_into API methods that allow tensors to be retrieved directly into pre-allocated, user-managed buffers, avoiding intermediate memory allocations and copies. This is particularly beneficial for high-performance inference scenarios where memory bandwidth and latency are critical.
Key Changes:
- Added four new zero-copy API methods:
get_tensor_into,batch_get_tensor_into,get_tensor_into_with_tp, andbatch_get_tensor_into_with_tp - Implemented
array_creators_without_freeinfrastructure to create numpy array views without memory ownership transfer - Added comprehensive functional and performance tests for all new zero-copy operations
- Increased default buffer sizes for testing (4 GiB → 16 GiB for global segment, 2 GB → 8 GB for local buffer)
Reviewed changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/test_tensor_api.py | Added test utilities (DTYPE_MAP, verify_tensor_equality) and comprehensive test cases for zero-copy operations with single/batch tensors and tensor parallelism support. Updated benchmark tests and increased default buffer sizes. |
| mooncake-integration/store/store_py.cpp | Implemented four new zero-copy methods that write directly to user-provided buffers, including tensor parallel variants. Added Python bindings for the new API methods. |
| mooncake-integration/integration_utils.h | Added create_typed_array_without_free and array_creators_without_free to create numpy arrays without memory ownership transfer for zero-copy operations. |
| mooncake-store/src/dummy_client.cpp | Whitespace-only change (added newline at end of file). |
| mooncake-store/include/dummy_client.h | Whitespace-only change (added newline at end of file). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/test_tensor_api.py
Outdated
| DEFAULT_GLOBAL_SEGMENT_SIZE = 16 * 1024 * 1024 * 1024 # 4 GiB | ||
| DEFAULT_LOCAL_BUFFER_SIZE = 8 * 1024 * 1024 * 1024 # 2 GB |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates the old values (4 GiB, 2 GB) but the constants have been changed to 16 GiB and 8 GB. The comments should be updated to reflect the new values to avoid confusion.
| DEFAULT_GLOBAL_SEGMENT_SIZE = 16 * 1024 * 1024 * 1024 # 4 GiB | |
| DEFAULT_LOCAL_BUFFER_SIZE = 8 * 1024 * 1024 * 1024 # 2 GB | |
| DEFAULT_GLOBAL_SEGMENT_SIZE = 16 * 1024 * 1024 * 1024 # 16 GiB | |
| DEFAULT_LOCAL_BUFFER_SIZE = 8 * 1024 * 1024 * 1024 # 8 GiB |
scripts/test_tensor_api.py
Outdated
|
|
||
| for rank in range(tp_size): | ||
| batch_size = len(keys) | ||
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "1GB per tensor slot" but the buffer_spacing is set to 64 MB. The comment should be updated to match the actual buffer size.
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot | |
| buffer_spacing = 64 * 1024 * 1024 # 64MB per tensor slot |
scripts/test_tensor_api.py
Outdated
| split_dim = 0 | ||
| batch_size = len(self.keys) | ||
| self.store.remove_all() | ||
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "1GB per tensor slot" but the buffer_spacing is set to 64 MB. The comment should be updated to match the actual buffer size.
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot | |
| buffer_spacing = 64 * 1024 * 1024 # 64MB per tensor slot |
| np_array = array_creators_without_free[dtype_index]( | ||
| static_cast<char *>(buffer), sizeof(TensorMetadata), | ||
| tensor_size); | ||
| } else { | ||
| LOG(ERROR) << "Unsupported dtype enum: " << dtype_index; | ||
| return pybind11::none(); | ||
| } |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses array_creators_without_free but the variable name is inconsistent - it should check the correct array size. Also, line 677 has the same pattern. Both locations reference array_creators.size() in the condition but use array_creators_without_free in the actual call, which could be confusing. Consider checking array_creators_without_free.size() for consistency.
| if (dtype_index >= 0 && | ||
| dtype_index < static_cast<int>(array_creators.size())) { | ||
| // This call MUST take ownership of exported_data | ||
| np_array = array_creators_without_free[dtype_index]( | ||
| static_cast<char *>(buffer), sizeof(TensorMetadata), | ||
| tensor_size); | ||
| } else { |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses array_creators_without_free but checks against array_creators.size() in the condition. For consistency and clarity, consider checking array_creators_without_free.size() instead, matching the array being used.
scripts/test_tensor_api.py
Outdated
|
|
||
| def verify_tensor_equality(original, received, rtol=0, atol=0, verbose=True): | ||
| """ | ||
| compare two tensors。 |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chinese punctuation used in the comment. The period (。) should be replaced with an English period (.).
| compare two tensors。 | |
| compare two tensors. |
scripts/test_tensor_api.py
Outdated
| def test_benchmark_03_batch_put_get_into(self): | ||
| """Benchmark: Zero copy Batch Get.""" | ||
| self.store.remove_all() | ||
| buffer_spacing = 300 * 1024 * 1024 # 1GB per tensor slot |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "1GB per tensor slot" but the buffer_spacing is set to 300 MB. The comment should be updated to match the actual buffer size.
| buffer_spacing = 300 * 1024 * 1024 # 1GB per tensor slot | |
| buffer_spacing = 300 * 1024 * 1024 # 300MB per tensor slot |
scripts/test_tensor_api.py
Outdated
| registered_buffers = [] # Keep track of (ptr, size) for cleanup | ||
|
|
||
| for rank in range(tp_size): | ||
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "1GB per tensor slot" but the buffer_spacing is set to 64 MB. The comment should be updated to match the actual buffer size.
| buffer_spacing = 64 * 1024 * 1024 # 1GB per tensor slot | |
| buffer_spacing = 64 * 1024 * 1024 # 64MB per tensor slot |
scripts/test_tensor_api.py
Outdated
| import argparse | ||
| import unittest | ||
| import torch | ||
| import struct |
Copilot
AI
Dec 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'struct' is not used.
| import struct |
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
bc1aa3b to
6b80b41
Compare
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
40f1bc6 to
c515320
Compare
Description
Type of Change
How Has This Been Tested?
Checklist