Skip to content

Conversation

@nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented Dec 9, 2025

This PR enables using cudf::chunked_pack when copying a TableChunk to Host memory, IF there is not enough device memory for a cudf::pack operation.

Signed-off-by: niranda perera <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 9, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@nirandaperera nirandaperera added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 9, 2025
@nirandaperera nirandaperera marked this pull request as ready for review December 10, 2025 00:08
@nirandaperera nirandaperera requested a review from a team as a code owner December 10, 2025 00:08
Comment on lines 318 to 320
// TODO: there is a possibility that bounce buffer destructor is a called before the
// async copies are completed. Should we synchronize the stream here?
bounce_buf->unlock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsbk WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operations here are stream ordered. As long as the stream used by table is the same as the stream used by the bounce buffer, everything is correct.

Signed-off-by: niranda perera <[email protected]>
Signed-off-by: niranda perera <[email protected]>
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirandaperera, overall looks good.

@nirandaperera nirandaperera requested a review from madsbk December 10, 2025 18:12
@nirandaperera
Copy link
Contributor Author

@madsbk Can you take another look 🙂

Comment on lines 180 to 218
if (overbooking > 0) {
// there is not enough memory to pack the table.
size_t avail_dev_mem = pack_res.size() - overbooking;
RAPIDSMPF_EXPECTS(
avail_dev_mem > 1 << 20,
"not enough device memory for the bounce buffer",
std::runtime_error
);
auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res);

packed_data = std::make_unique<PackedData>(
chunked_pack(table_view(), *bounce_buf, reservation)
);
} else {
// if there is enough memory to pack the table, use `cudf::pack`
auto packed_columns =
cudf::pack(table_view(), stream(), br->device_mr());
// clear the reservation as we are done with it.
pack_res.clear();
packed_data = std::make_unique<PackedData>(
std::move(packed_columns.metadata),
br->move(std::move(packed_columns.gpu_data), stream())
);

// Handle the case where `cudf::pack` allocates slightly more than
// the input size. This can occur because cudf uses aligned
// allocations, which may exceed the requested size. To
// accommodate this, we allow some wiggle room.
if (packed_data->data->size > reservation.size()) {
if (packed_data->data->size
<= reservation.size()
+ total_packing_wiggle_room(table_view()))
{
reservation =
br->reserve(
MemoryType::HOST, packed_data->data->size, true
)
.first;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a helper function, I think nesting becomes a problem.
It would also be worth exploring if we even need the non-chunked version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?

Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me add a benchmark.

* @return The total amount of extra memory to reserve for packing.
*/
inline size_t total_packing_wiggle_room(cudf::table_view const& table) {
return packing_wiggle_room_per_column * static_cast<size_t>(table.num_columns());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This is likely not enough if the table has many nested columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +300 to +302
// all copies are done on the same stream, so we can omit the stream parameter
cudf::device_span<uint8_t> buf_span(
reinterpret_cast<uint8_t*>(bounce_buf_ptr), chunk_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let use use cuda::std::span.

Also the comment is meaningless, device_span has no stream parameter to its ctor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment refers to the stream_view arg in L299 😇

.first;
if (overbooking > 0) {
// there is not enough memory to pack the table.
size_t avail_dev_mem = pack_res.size() - overbooking;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Integer overflow. suppose we can't make a reservation, so pack_res.size() is zero, and overbooking is estimated_memory_usage(...). Then this will be (size_t)(-overbooking).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- I'm not sure if this is true. Reservation is made with overbooking
https://github.com/nirandaperera/rapidsmpf/blob/Make-unbounded-fanout-state-spillable/cpp/src/memory/buffer_resource.cpp#L58-L64
So, it will be

{MemoryReservation(mem_type, this, size), overbooking};

I think the case you are referring to, is for without overbooking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overbooking is positive. Can it happen that overbooking is larger than pack_res.size()? If yes, then you have integer overflow.

Comment on lines 180 to 218
if (overbooking > 0) {
// there is not enough memory to pack the table.
size_t avail_dev_mem = pack_res.size() - overbooking;
RAPIDSMPF_EXPECTS(
avail_dev_mem > 1 << 20,
"not enough device memory for the bounce buffer",
std::runtime_error
);
auto bounce_buf = br->allocate(avail_dev_mem, stream(), pack_res);

packed_data = std::make_unique<PackedData>(
chunked_pack(table_view(), *bounce_buf, reservation)
);
} else {
// if there is enough memory to pack the table, use `cudf::pack`
auto packed_columns =
cudf::pack(table_view(), stream(), br->device_mr());
// clear the reservation as we are done with it.
pack_res.clear();
packed_data = std::make_unique<PackedData>(
std::move(packed_columns.metadata),
br->move(std::move(packed_columns.gpu_data), stream())
);

// Handle the case where `cudf::pack` allocates slightly more than
// the input size. This can occur because cudf uses aligned
// allocations, which may exceed the requested size. To
// accommodate this, we allow some wiggle room.
if (packed_data->data->size > reservation.size()) {
if (packed_data->data->size
<= reservation.size()
+ total_packing_wiggle_room(table_view()))
{
reservation =
br->reserve(
MemoryType::HOST, packed_data->data->size, true
)
.first;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless there are real performance issues with chunked_pack we should try and always use that. Can we check?

Also agree, let's make a helper function that encapsulates this (so that we can use it outside of tablechunk-copy as well -> this pattern of cudf::pack happens in many places.

Co-authored-by: Mads R. B. Kristensen <[email protected]>
Co-authored-by: Lawrence Mitchell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants