Skip to content

GH-49288: [C++][ORC] Add OrcFileFragment with stripe-level subsetting#49289

Open
ShreyeshArangath wants to merge 4 commits intoapache:mainfrom
ShreyeshArangath:shreyesh/pycapsule-protocol-impl
Open

GH-49288: [C++][ORC] Add OrcFileFragment with stripe-level subsetting#49289
ShreyeshArangath wants to merge 4 commits intoapache:mainfrom
ShreyeshArangath:shreyesh/pycapsule-protocol-impl

Conversation

@ShreyeshArangath
Copy link

@ShreyeshArangath ShreyeshArangath commented Feb 15, 2026

Rationale for this change

The ORC dataset integration currently lacks stripe-level subsetting support. When scanning ORC files through the Dataset API, there is no way to select specific stripes. The entire file is always read. This is a gap compared to ParquetFileFragment, which provides row-group-level subsetting via Subset(), row_groups(), and MakeFragment(..., row_groups).

What changes are included in this PR?

Modeled after the ParquetFileFragment design, we introduce stripe-aware ORC fragments so callers can target specific stripes during planning and scanning (instead of always reading the full file). This adds a small, consistent surface area in both C++ (and Python, separate issue):

An ORC-specific fragment type that can represent either the full file, or a subset of the file defined by stripe IDs
Fragment subsetting via a subset(...)/Subset(...) API, analogous to Parquet row-group subsetting.
Scan behavior that honors stripe selection, so execution reads only the requested stripes.
Correct row counting for subset fragments, where row counts reflect only the selected stripes

Are these changes tested?

  • Unit tested

Are there any user-facing changes?

The C++ API has the following changes

  • OrcFileFragment class with stripe_ids() and Subset() methods
  • OrcFileFormat::MakeFragment(source, partition_expression, physical_schema, stripe_ids) overload

Though, there are no breaking changes. Existing ORC scanning behavior is unchanged when no stripe IDs are specified.

  • AI Usage disclosure: I did use an AI IDE to make this PR but I reviewed the code and stand behind its changes and tests.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@ShreyeshArangath ShreyeshArangath changed the title [C++][ORC] Add OrcFileFragment with stripe-level subsetting GH-49288: [C++][ORC] Add OrcFileFragment with stripe-level subsetting Feb 15, 2026
@github-actions
Copy link

⚠️ GitHub issue #49288 has been automatically assigned in GitHub to PR creator.

@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review February 15, 2026 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant