-
Notifications
You must be signed in to change notification settings - Fork 26
Adding chunked_pack to pack table chunks when device memory is insufficient #726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: niranda perera <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: niranda perera <[email protected]>
| // TODO: there is a possibility that bounce buffer destructor is a called before the | ||
| // async copies are completed. Should we synchronize the stream here? | ||
| bounce_buf->unlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madsbk WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operations here are stream ordered. As long as the stream used by table is the same as the stream used by the bounce buffer, everything is correct.
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nirandaperera, overall looks good.
Co-authored-by: Mads R. B. Kristensen <[email protected]>
Signed-off-by: niranda perera <[email protected]>
|
@madsbk Can you take another look 🙂 |
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; | ||
| RAPIDSMPF_EXPECTS( | ||
| avail_dev_mem > 1 << 20, | ||
| "not enough device memory for the bounce buffer", | ||
| std::runtime_error | ||
| ); | ||
| auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res); | ||
|
|
||
| packed_data = std::make_unique<PackedData>( | ||
| chunked_pack(table_view(), *bounce_buf, reservation) | ||
| ); | ||
| } else { | ||
| // if there is enough memory to pack the table, use `cudf::pack` | ||
| auto packed_columns = | ||
| cudf::pack(table_view(), stream(), br->device_mr()); | ||
| // clear the reservation as we are done with it. | ||
| pack_res.clear(); | ||
| packed_data = std::make_unique<PackedData>( | ||
| std::move(packed_columns.metadata), | ||
| br->move(std::move(packed_columns.gpu_data), stream()) | ||
| ); | ||
|
|
||
| // Handle the case where `cudf::pack` allocates slightly more than | ||
| // the input size. This can occur because cudf uses aligned | ||
| // allocations, which may exceed the requested size. To | ||
| // accommodate this, we allow some wiggle room. | ||
| if (packed_data->data->size > reservation.size()) { | ||
| if (packed_data->data->size | ||
| <= reservation.size() | ||
| + total_packing_wiggle_room(table_view())) | ||
| { | ||
| reservation = | ||
| br->reserve( | ||
| MemoryType::HOST, packed_data->data->size, true | ||
| ) | ||
| .first; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a helper function, I think nesting becomes a problem.
It would also be worth exploring if we even need the non-chunked version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?
Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me add a benchmark.
| * @return The total amount of extra memory to reserve for packing. | ||
| */ | ||
| inline size_t total_packing_wiggle_room(cudf::table_view const& table) { | ||
| return packing_wiggle_room_per_column * static_cast<size_t>(table.num_columns()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: This is likely not enough if the table has many nested columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was just mimicking the current impl we have here
https://github.com/rapidsai/rapidsmpf/blob/main/cpp/src/streaming/cudf/table_chunk.cpp#L187-L189
| // all copies are done on the same stream, so we can omit the stream parameter | ||
| cudf::device_span<uint8_t> buf_span( | ||
| reinterpret_cast<uint8_t*>(bounce_buf_ptr), chunk_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let use use cuda::std::span.
Also the comment is meaningless, device_span has no stream parameter to its ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment refers to the stream_view arg in L299 😇
| .first; | ||
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Integer overflow. suppose we can't make a reservation, so pack_res.size() is zero, and overbooking is estimated_memory_usage(...). Then this will be (size_t)(-overbooking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- I'm not sure if this is true. Reservation is made with overbooking
https://github.com/nirandaperera/rapidsmpf/blob/Make-unbounded-fanout-state-spillable/cpp/src/memory/buffer_resource.cpp#L58-L64
So, it will be
{MemoryReservation(mem_type, this, size), overbooking};I think the case you are referring to, is for without overbooking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overbooking is positive. Can it happen that overbooking is larger than pack_res.size()? If yes, then you have integer overflow.
| if (overbooking > 0) { | ||
| // there is not enough memory to pack the table. | ||
| size_t avail_dev_mem = pack_res.size() - overbooking; | ||
| RAPIDSMPF_EXPECTS( | ||
| avail_dev_mem > 1 << 20, | ||
| "not enough device memory for the bounce buffer", | ||
| std::runtime_error | ||
| ); | ||
| auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res); | ||
|
|
||
| packed_data = std::make_unique<PackedData>( | ||
| chunked_pack(table_view(), *bounce_buf, reservation) | ||
| ); | ||
| } else { | ||
| // if there is enough memory to pack the table, use `cudf::pack` | ||
| auto packed_columns = | ||
| cudf::pack(table_view(), stream(), br->device_mr()); | ||
| // clear the reservation as we are done with it. | ||
| pack_res.clear(); | ||
| packed_data = std::make_unique<PackedData>( | ||
| std::move(packed_columns.metadata), | ||
| br->move(std::move(packed_columns.gpu_data), stream()) | ||
| ); | ||
|
|
||
| // Handle the case where `cudf::pack` allocates slightly more than | ||
| // the input size. This can occur because cudf uses aligned | ||
| // allocations, which may exceed the requested size. To | ||
| // accommodate this, we allow some wiggle room. | ||
| if (packed_data->data->size > reservation.size()) { | ||
| if (packed_data->data->size | ||
| <= reservation.size() | ||
| + total_packing_wiggle_room(table_view())) | ||
| { | ||
| reservation = | ||
| br->reserve( | ||
| MemoryType::HOST, packed_data->data->size, true | ||
| ) | ||
| .first; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?
Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.
Co-authored-by: Mads R. B. Kristensen <[email protected]> Co-authored-by: Lawrence Mitchell <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
This PR enables using
cudf::chunked_packwhen copying aTableChunkto Host memory, IF there is not enough device memory for acudf::packoperation.