Skip to content

GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBinaryBuilder::Reserve#47185

Merged
pitrou merged 12 commits intoapache:mainfrom
mapleFU:GH-47184
Sep 16, 2025
Merged

GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBinaryBuilder::Reserve#47185
pitrou merged 12 commits intoapache:mainfrom
mapleFU:GH-47184

Conversation

@mapleFU
Copy link
Member

@mapleFU mapleFU commented Jul 24, 2025

Rationale for this change

Fix issue found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=6634425377161216

What changes are included in this PR?

This patch add more checkings directly in builder.

Are these changes tested?

By apache/arrow-testing#110

Are there any user-facing changes?

no

@github-actions
Copy link

⚠️ GitHub issue #47184 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better that we also show the requested capacity like

if (ARROW_PREDICT_FALSE(new_capacity < 0)) {
return Status::Invalid(
"Resize capacity must be positive (requested: ", new_capacity, ")");
}
if (ARROW_PREDICT_FALSE(new_capacity < length_)) {
return Status::Invalid("Resize cannot downsize (requested: ", new_capacity,
", current length: ", length_, ")");
}
does.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jul 25, 2025
@mapleFU mapleFU requested a review from pitrou August 19, 2025 15:24
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU You updated the parquet-testing submodule, but you should update arrow-testing instead, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have Status::CapacityError for such errors.

@pitrou pitrou changed the title GH-47184: [Parquet][C++] Try prevent from too large buffer when using FixedSizeBuilder GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBuilder::Reserve Aug 19, 2025
@pitrou pitrou changed the title GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBuilder::Reserve GH-47184: [Parquet][C++] Avoid multiplication overflow in FixedSizeBinaryBuilder::Reserve Aug 19, 2025
@mapleFU
Copy link
Member Author

mapleFU commented Aug 22, 2025

Don't know why cannot reproduce test failed on MacOS M1...🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know can we decrease overhead for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the inequality happen in the fuzz test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/arrow/actions/runs/17131673477/job/48597348959

The fuzzer will generate random underlying data from seeds, and this caused a heap buffer overflow when type mismatches. Maybe a boundary check on buffer is more lightweight, which only prevent from heap buffer overflow.

Size check would enforce underlying data to be same sized. However it introduce a size checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the inequality happen in the fuzz test?

@pitrou
Copy link
Member

pitrou commented Aug 25, 2025

Can you run the Parquet benchmarks to see if there any regressions?

@mapleFU
Copy link
Member Author

mapleFU commented Aug 25, 2025

I've change the checking from Builder ( which is commonly used ) to only DeltaByteArrayDecoder ( which is only place can cause FLBA to have this problem)

@pitrou
Copy link
Member

pitrou commented Aug 25, 2025

I've change the checking from Builder ( which is commonly used ) to only DeltaByteArrayDecoder ( which is only place can cause FLBA to have this problem)

Did you forget to push your changes?
(also, please rebase :-))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I'll remove this and have a try

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we need this change, I add cout and run fuzzing and found there're so many error message now:

Status FixedSizeBinaryBuilder::Resize(int64_t capacity) {
  RETURN_NOT_OK(CheckCapacity(capacity));
  int64_t dest_capacity_bytes;
  if (ARROW_PREDICT_FALSE(
          MultiplyWithOverflow(capacity, byte_width_, &dest_capacity_bytes))) {
    std::cout << "Resize: capacity overflows (requested: " << capacity <<
                                 ", byte_width: " << byte_width_ << std::endl;
    return Status::CapacityError("Resize: capacity overflows (requested: ", capacity,
                                 ", byte_width: ", byte_width_, ")");
  }
  RETURN_NOT_OK(byte_builder_.Resize(dest_capacity_bytes));
  return ArrayBuilder::Resize(capacity);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the problem here is:

  Status ReadColumn(int i, const std::vector<int>& row_groups, ColumnReader* reader,
                    std::shared_ptr<ChunkedArray>* out) {
    BEGIN_PARQUET_CATCH_EXCEPTIONS
    // TODO(wesm): This calculation doesn't make much sense when we have repeated
    // schema nodes
    int64_t records_to_read = 0;
    for (auto row_group : row_groups) {
      // Can throw exception
      std::cout << "Numvalues:" << reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values() << '\n';
      records_to_read +=
          reader_->metadata()->RowGroup(row_group)->ColumnChunk(i)->num_values();
    }
  1. Num values is int64_t::max()
  2. Generally, builder will throw exception. For fixed sized type, the value would exceeds the bounds

Perhaps for builder, the FLBA column is just so huge, and causes the issue. The integer and float types might also has this problem.

Copy link
Member Author

@mapleFU mapleFU Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's because ReserveValues is called above this:

  void ReserveValues(int64_t extra_values) override {
    ARROW_DCHECK(!uses_values_);
    TypedRecordReader::ReserveValues(extra_values);
    PARQUET_THROW_NOT_OK(array_builder_.Reserve(extra_values));
  }

!uses_values_ is only called by ByteArray and FLBA, so other types doesn't touches this case... And ByteArray would not have a multiplier, which is less possible to touch this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou gentle ping for idea here, should I prevent overflow from ReserveValues here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this change is ok. Sorry for the delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this in GetInternal instead? It will then benefit the Decode method as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would try to put in the end of GetInternal

@mapleFU
Copy link
Member Author

mapleFU commented Sep 9, 2025

gentle ping @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mapleFU . Can you just rebase so that we get updated CI results?

@pitrou
Copy link
Member

pitrou commented Sep 16, 2025

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Sep 16, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 08b020b

Submitted crossbow builds: ursacomputing/crossbow @ actions-f7e0ba7b53

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou merged commit 37dcfcc into apache:main Sep 16, 2025
37 of 39 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Sep 16, 2025
@pitrou
Copy link
Member

pitrou commented Sep 16, 2025

Thanks again for fixing this @mapleFU !

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 37dcfcc.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Oct 15, 2025
…SizeBinaryBuilder::Reserve (apache#47185)

### Rationale for this change

Fix issue found by OSS-Fuzz: https://oss-fuzz.com/testcase?key=6634425377161216

### What changes are included in this PR?

This patch add more checkings directly in builder. 

### Are these changes tested?

By apache/arrow-testing#110

### Are there any user-facing changes?

no
* GitHub Issue: apache#47184

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants