Skip to content

GH-43660: [C++][Compute] Avoid ZeroCopyCastExec when casting Binary offset -> Binary offset types#48171

Merged
zanmato1984 merged 15 commits intoapache:mainfrom
bodo-ai:scott/cast_string
Dec 17, 2025
Merged

GH-43660: [C++][Compute] Avoid ZeroCopyCastExec when casting Binary offset -> Binary offset types#48171
zanmato1984 merged 15 commits intoapache:mainfrom
bodo-ai:scott/cast_string

Conversation

@scott-routledge2
Copy link
Contributor

@scott-routledge2 scott-routledge2 commented Nov 19, 2025

Rationale for this change

Casting Binary offset -> Binary offset types relies on ZeroCopyCastExec, which propagates the offset of the input to the output. This can lead to larger allocations than necessary when casting arrays with offsets.

See #43660 and
#43661 for more context.

What changes are included in this PR?

Ensure output array has a small offset (it can still be non-zero since reusing the null bitmap requires in_offset % 8 == out_offset % 8)

Are these changes tested?

Ran unit tests and benchmarked locally.

Are there any user-facing changes?

No

@felipecrv felipecrv self-requested a review November 19, 2025 18:17
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! But you should add tests of casting of [Large]String containing non-zero and >8 offsets. I don't think your current implementation would work without also slicing the offsets buffer.

Comment on lines +315 to +321
if (output->buffers[0]) {
// If reusing the null bitmap, ensure offset into the first byte is the same as input.
output->offset = input_arr->offset % 8;
output->buffers[0] = SliceBuffer(output->buffers[0], input_arr->offset / 8);
} else {
output->offset = 0;
}
Copy link
Contributor

@felipecrv felipecrv Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you should also slice the offsets buffer when you change the array offset.

Suggested change
if (output->buffers[0]) {
// If reusing the null bitmap, ensure offset into the first byte is the same as input.
output->offset = input_arr->offset % 8;
output->buffers[0] = SliceBuffer(output->buffers[0], input_arr->offset / 8);
} else {
output->offset = 0;
}
// slice buffers to reduce allocation when casting the offsets buffer
auto length = input_arr->length;
int64_t buffer_offset = 0;
if (output->null_count != 0 && output->buffers[0]) {
// avoiding reallocation of the validity buffer by allowing some padding bits
output->offset = input_arr->offset % 8;
buffer_offset = input_arr->offset / 8;
} else {
output->offset = 0;
buffer_offset = input_arr->offset;
}
output->buffers[0] = SliceBuffer(output->buffers[0], buffer_offset, length);
output->buffers[1] = SliceBuffer(output->buffers[1], buffer_offset, length);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review my suggestions and tell if something is weird. This stuff is hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! I didn't slice the offset buffers originally because they are reallocated by CastBinaryToBinaryOffsets with the correct offset. Although, this would only be true for the case where the offset type changes, in the case where the offset stays the same then we would either have to slice here or call ZeroCopyCastExec like in the comment above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you don't slice they get reallocated with the same length, no? Nevertheless, it becomes a creates a very weird relationship between this code and the function. Making it very non-obvious how it is correct.

Copy link
Contributor Author

@scott-routledge2 scott-routledge2 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. What you are saying makes sense, however, I think I am still a little confused about specifics of the slicing here. Wouldn't we want to slice the offset buffer with a different value than the validity buffer?

For example, if we are casting a slice that has length 8 and offset 8, we would slice the
validity buffer with a value of 1, and be left with a buffer of length 1 representing the 8 null bits for the elements in our casted slice. We would also slice the offsets buffer by 1, which would mean the buffer will have length 17*offset_size - 1 and be out of alignment.

Similarly in the case where we have output->null_count == 0 (and buffer[0] != nullptr), we would slice the offset buffer by 8, leaving a buffer of size 17*offset_size - 8 and we would also slice the validity buffer by 8 , which would go out of bounds.

Wouldn't we want to slice the offsets buffer by input_arr->offset* sizeof(typename I::offset_type) and the validity buffer by input_arr->offset / 8

Edit: I think the "offset" in SliceBuffer is the byte-offset as opposed to the logical offset.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 19, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 3, 2025
if (output->buffers[0]) {
output->buffers[0] = SliceBuffer(output->buffers[0], offset / 8);
}
output->buffers[1] = SliceBuffer(output->buffers[1], offset * input_offset_type_size);
Copy link
Contributor Author

@scott-routledge2 scott-routledge2 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in order for this slicing logic to be fully correct it needs to take into account output->offset when doing the slicing as well as the minimum size requirements. So it would look something like

int64_t offset_buffer_offset = (input_offset - output_offset) * offset_type_size
int64_t offset_buffer_length = (length + output_offset + 1) * offset_type_size
if (offset_buffer_length < minimum_size) {
  // update the output->offset so that it's now == minimum size
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code seems correct to me. I'm not sure what's the concern here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was testing the buffer slicing logic here (for example, casting a slice with length 3 and offset 1) I was seeing an error that looked like:

'datum.make_array()->ValidateFull()' failed with Invalid: Offsets buffer size (bytes): 16 isn't large enough for length: 3 and offset: 1

Since I think technically the offsets buffer needs to be big enough to hold offset + length + 1 elements whereas the way this is written it will only hold length + 1 many elements. However, because of the code structure here, this path isn't actually reachable since we always reallocate the offsets buffer with the correct size in CastBinaryToBinaryOffsets anyways.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement. Some nits.

if (output->buffers[0]) {
output->buffers[0] = SliceBuffer(output->buffers[0], offset / 8);
}
output->buffers[1] = SliceBuffer(output->buffers[1], offset * input_offset_type_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code seems correct to me. I'm not sure what's the concern here.

@scott-routledge2 scott-routledge2 marked this pull request as ready for review December 10, 2025 00:14
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

Some nits.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks for this nice improvement!

@zanmato1984
Copy link
Contributor

Let's wait for some more while to see if @felipecrv has other comments. Then I'll merge. Thanks @scott-routledge2 !

@zanmato1984
Copy link
Contributor

@github-actions crossbow submit -g cpp -g python

@github-actions
Copy link

Revision: 4929843

Submitted crossbow builds: ursacomputing/crossbow @ actions-6816eda394

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.14 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-cpp-ubuntu-24.04-cuda-13.0.2 GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-python-ubuntu-24.04-cuda-13.0.2 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@zanmato1984
Copy link
Contributor

The failures seem unrelated. I'm merging now.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 86166d5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 8 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants