GH-48105: [C++][Parquet][IPC] Cap allocated memory when fuzzing#48108
GH-48105: [C++][Parquet][IPC] Cap allocated memory when fuzzing#48108pitrou merged 3 commits intoapache:mainfrom
Conversation
|
@github-actions crossbow submit -g cpp |
|
Revision: c63e380 Submitted crossbow builds: ursacomputing/crossbow @ actions-a5b0deeb96 |
c63e380 to
ff94690
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: ff94690 Submitted crossbow builds: ursacomputing/crossbow @ actions-8fa1900da3 |
zanmato1984
left a comment
There was a problem hiding this comment.
Thanks for working on this. I like the idea of having a concrete memory pool implementation to limit the total allocated memory - a way showing the power of the abstraction of memory pool. One concern though.
| Status Allocate(int64_t size, int64_t alignment, uint8_t** out) override { | ||
| const auto attempted = size + wrapped_->bytes_allocated(); | ||
| if (ARROW_PREDICT_FALSE(attempted > bytes_allocated_limit_)) { | ||
| return OutOfMemory(attempted); |
There was a problem hiding this comment.
I like the idea to cap the memory allocation. We have seen memory issues while reading large Parquet files in a multi-tenant environment. At least it can help protect the stability.
There was a problem hiding this comment.
With the limitation that it probably won't account for all memory allocations (for example STL structures used to represent metadata).
ff94690 to
5e279d4
Compare
zanmato1984
left a comment
There was a problem hiding this comment.
I see outstanding fuzz targets namely arrow/csv/fuzz.cc, arrow/ipc/tensor_stream_fuzz.cc and parquet/arrow/fuzz.cc. Do we want to instrument those as well?
cpp/src/arrow/util/fuzz_internal.h
Outdated
| ARROW_EXPORT MemoryPool* fuzzing_memory_pool(); | ||
|
|
||
| // Optionally log the outcome of fuzzing an input | ||
| ARROW_EXPORT void NoteFuzzStatus(const Status&, const uint8_t* data, int64_t size); |
There was a problem hiding this comment.
Do you think a name like LogFuzzStatus can be more straightforward?
Those are just the top-level "main" files for fuzzing, but they call the functions that are modified in this PR. |
5e279d4 to
0b0f8f9
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 0b0f8f9 Submitted crossbow builds: ursacomputing/crossbow @ actions-562ff7b1ac |
zanmato1984
left a comment
There was a problem hiding this comment.
+1. Thanks for working on this.
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7c3d486. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
OSS-Fuzz will trigger an out-of-memory crash if the allocated memory goes beyond a predefined limit (usually 2560 MB, though that can be configured). For Parquet and IPC, it is legitimate to allocate a lot of memory when decompressing data, though, so that can happen on both valid and invalid input files.
Unfortunately, OSS-Fuzz checks for this memory limit not by instrumenting malloc and having it return NULL when the limit is reached, but by checking allocated memory periodically from a separate thread. This can be solved by implementing our custom allocator with an upper limit, exactly how the mupdf project did in google/oss-fuzz#1830
What changes are included in this PR?
CappedMemoryPoolCappedMemoryPoolwith a hardcoded limit in the Parquet and IPC fuzz targetsAre these changes tested?
Yes, by additional unit tests.
Are there any user-facing changes?
No.