GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBinaryBuilder::Reserve#47185
GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBinaryBuilder::Reserve#47185pitrou merged 12 commits intoapache:mainfrom
Conversation
|
|
There was a problem hiding this comment.
It may be better that we also show the requested capacity like
arrow/cpp/src/arrow/array/builder_base.h
Lines 286 to 294 in 25f8f00
There was a problem hiding this comment.
We have Status::CapacityError for such errors.
|
Don't know why cannot reproduce test failed on MacOS M1...🤔 |
cpp/src/parquet/decoder.cc
Outdated
There was a problem hiding this comment.
I don't know can we decrease overhead for this
There was a problem hiding this comment.
Did the inequality happen in the fuzz test?
There was a problem hiding this comment.
See https://github.com/apache/arrow/actions/runs/17131673477/job/48597348959
The fuzzer will generate random underlying data from seeds, and this caused a heap buffer overflow when type mismatches. Maybe a boundary check on buffer is more lightweight, which only prevent from heap buffer overflow.
Size check would enforce underlying data to be same sized. However it introduce a size checking.
cpp/src/parquet/decoder.cc
Outdated
There was a problem hiding this comment.
Did the inequality happen in the fuzz test?
|
Can you run the Parquet benchmarks to see if there any regressions? |
|
I've change the checking from Builder ( which is commonly used ) to only DeltaByteArrayDecoder ( which is only place can cause FLBA to have this problem) |
Did you forget to push your changes? |
There was a problem hiding this comment.
I don't know, I'll remove this and have a try
There was a problem hiding this comment.
I suggest we need this change, I add cout and run fuzzing and found there're so many error message now:
Status FixedSizeBinaryBuilder::Resize(int64_t capacity) {
RETURN_NOT_OK(CheckCapacity(capacity));
int64_t dest_capacity_bytes;
if (ARROW_PREDICT_FALSE(
MultiplyWithOverflow(capacity, byte_width_, &dest_capacity_bytes))) {
std::cout << "Resize: capacity overflows (requested: " << capacity <<
", byte_width: " << byte_width_ << std::endl;
return Status::CapacityError("Resize: capacity overflows (requested: ", capacity,
", byte_width: ", byte_width_, ")");
}
RETURN_NOT_OK(byte_builder_.Resize(dest_capacity_bytes));
return ArrayBuilder::Resize(capacity);
}
There was a problem hiding this comment.
Perhaps the problem here is:
Status ReadColumn(int i, const std::vector<int>& row_groups, ColumnReader* reader,
std::shared_ptr<ChunkedArray>* out) {
BEGIN_PARQUET_CATCH_EXCEPTIONS
// TODO(wesm): This calculation doesn't make much sense when we have repeated
// schema nodes
int64_t records_to_read = 0;
for (auto row_group : row_groups) {
// Can throw exception
std::cout << "Numvalues:" << reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values() << '\n';
records_to_read +=
reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values();
}- Num values is
int64_t::max() - Generally, builder will throw exception. For fixed sized type, the value would exceeds the bounds
Perhaps for builder, the FLBA column is just so huge, and causes the issue. The integer and float types might also has this problem.
There was a problem hiding this comment.
Oh, it's because ReserveValues is called above this:
void ReserveValues(int64_t extra_values) override {
ARROW_DCHECK(!uses_values_);
TypedRecordReader::ReserveValues(extra_values);
PARQUET_THROW_NOT_OK(array_builder_.Reserve(extra_values));
}!uses_values_ is only called by ByteArray and FLBA, so other types doesn't touches this case... And ByteArray would not have a multiplier, which is less possible to touch this
There was a problem hiding this comment.
@pitrou gentle ping for idea here, should I prevent overflow from ReserveValues here?
There was a problem hiding this comment.
Yes, I think this change is ok. Sorry for the delay.
cpp/src/parquet/decoder.cc
Outdated
There was a problem hiding this comment.
Should we do this in GetInternal instead? It will then benefit the Decode method as well.
There was a problem hiding this comment.
Would try to put in the end of GetInternal
|
gentle ping @pitrou |
… using FixedSizedBuilder
This reverts commit 11c7ad250a610fdcd054341efb7662467f3558a1.
|
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
|
@github-actions crossbow submit -g cpp |
|
Revision: 08b020b Submitted crossbow builds: ursacomputing/crossbow @ actions-f7e0ba7b53 |
|
Thanks again for fixing this @mapleFU ! |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 37dcfcc. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
…SizeBinaryBuilder::Reserve (apache#47185) ### Rationale for this change Fix issue found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=6634425377161216 ### What changes are included in this PR? This patch add more checkings directly in builder. ### Are these changes tested? By apache/arrow-testing#110 ### Are there any user-facing changes? no * GitHub Issue: apache#47184 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Fix issue found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=6634425377161216
What changes are included in this PR?
This patch add more checkings directly in builder.
Are these changes tested?
By apache/arrow-testing#110
Are there any user-facing changes?
no