Skip to content

GH-48105: [C++][Parquet][IPC] Cap allocated memory when fuzzing#48108

Merged
pitrou merged 3 commits intoapache:mainfrom
pitrou:gh48105-fuzzing-cap-memory
Nov 15, 2025
Merged

GH-48105: [C++][Parquet][IPC] Cap allocated memory when fuzzing#48108
pitrou merged 3 commits intoapache:mainfrom
pitrou:gh48105-fuzzing-cap-memory

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Nov 12, 2025

Rationale for this change

OSS-Fuzz will trigger an out-of-memory crash if the allocated memory goes beyond a predefined limit (usually 2560 MB, though that can be configured). For Parquet and IPC, it is legitimate to allocate a lot of memory when decompressing data, though, so that can happen on both valid and invalid input files.

Unfortunately, OSS-Fuzz checks for this memory limit not by instrumenting malloc and having it return NULL when the limit is reached, but by checking allocated memory periodically from a separate thread. This can be solved by implementing our custom allocator with an upper limit, exactly how the mupdf project did in google/oss-fuzz#1830

What changes are included in this PR?

  1. Implement a CappedMemoryPool
  2. Use the CappedMemoryPool with a hardcoded limit in the Parquet and IPC fuzz targets

Are these changes tested?

Yes, by additional unit tests.

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: c63e380

Submitted crossbow builds: ursacomputing/crossbow @ actions-a5b0deeb96

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou marked this pull request as ready for review November 12, 2025 13:30
@pitrou pitrou requested a review from wgtmac as a code owner November 12, 2025 13:30
@pitrou pitrou requested a review from zanmato1984 November 12, 2025 13:30
@pitrou pitrou force-pushed the gh48105-fuzzing-cap-memory branch from c63e380 to ff94690 Compare November 12, 2025 16:57
@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: ff94690

Submitted crossbow builds: ursacomputing/crossbow @ actions-8fa1900da3

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I like the idea of having a concrete memory pool implementation to limit the total allocated memory - a way showing the power of the abstraction of memory pool. One concern though.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 12, 2025
Status Allocate(int64_t size, int64_t alignment, uint8_t** out) override {
const auto attempted = size + wrapped_->bytes_allocated();
if (ARROW_PREDICT_FALSE(attempted > bytes_allocated_limit_)) {
return OutOfMemory(attempted);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to cap the memory allocation. We have seen memory issues while reading large Parquet files in a multi-tenant environment. At least it can help protect the stability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the limitation that it probably won't account for all memory allocations (for example STL structures used to represent metadata).

@pitrou pitrou force-pushed the gh48105-fuzzing-cap-memory branch from ff94690 to 5e279d4 Compare November 13, 2025 08:59
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see outstanding fuzz targets namely arrow/csv/fuzz.cc, arrow/ipc/tensor_stream_fuzz.cc and parquet/arrow/fuzz.cc. Do we want to instrument those as well?

ARROW_EXPORT MemoryPool* fuzzing_memory_pool();

// Optionally log the outcome of fuzzing an input
ARROW_EXPORT void NoteFuzzStatus(const Status&, const uint8_t* data, int64_t size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a name like LogFuzzStatus can be more straightforward?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes :-)

@pitrou
Copy link
Member Author

pitrou commented Nov 14, 2025

I see outstanding fuzz targets namely arrow/csv/fuzz.cc, arrow/ipc/tensor_stream_fuzz.cc and parquet/arrow/fuzz.cc. Do we want to instrument those as well?

Those are just the top-level "main" files for fuzzing, but they call the functions that are modified in this PR.

@pitrou pitrou force-pushed the gh48105-fuzzing-cap-memory branch from 5e279d4 to 0b0f8f9 Compare November 15, 2025 09:07
@pitrou
Copy link
Member Author

pitrou commented Nov 15, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 0b0f8f9

Submitted crossbow builds: ursacomputing/crossbow @ actions-562ff7b1ac

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks for working on this.

@pitrou pitrou merged commit 7c3d486 into apache:main Nov 15, 2025
45 of 48 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Nov 15, 2025
@pitrou pitrou deleted the gh48105-fuzzing-cap-memory branch November 15, 2025 10:41
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7c3d486.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments