Skip to content

Conversation

@devanshu0987
Copy link
Contributor

This adds a preimage implementation for the floor() function that transforms floor(x) = N into x >= N AND x < N+1. This enables statistics-based predicate pushdown for queries using floor().

For example, a query like:
SELECT * FROM t WHERE floor(price) = 100

Is rewritten to:
SELECT * FROM t WHERE price >= 100 AND price < 101

This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned.

Benchmarks on the ClickBench hits dataset show:

  • 80% file pruning (89 out of 111 files skipped)
  • 70x fewer rows scanned (1.4M vs 100M)
CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'benchmarks/data/hits_partitioned/';

-- Test the floor preimage optimization
EXPLAIN ANALYZE SELECT COUNT(*) FROM hits WHERE floor(CAST("CounterID" AS DOUBLE)) = 62;

Metric Before (no preimage) After (with preimage)
Files pruned 111 → 111 (0 pruned) 111 → 22 (89 pruned)
Row groups pruned 325 → 325 (0 pruned) 51 → 4 (47 pruned)
Rows scanned 99,997,497 1,410,000
Output rows 738,172 738,172
Pruning predicate None CAST(CounterID_max) >= 62 AND CAST(CounterID_min) < 63

Which issue does this PR close?

  • Closes #.

Rationale for this change

#19946

This epic introduced the pre-image API. This PR is using the pre-image API to provide it for floor function where it is applicable.

What changes are included in this PR?

Are these changes tested?

  • Unit Tests added
  • Existing SLT tests pass for this.

Are there any user-facing changes?

No

@github-actions github-actions bot added the functions Changes to functions implementation label Jan 29, 2026
This adds a `preimage` implementation for the `floor()` function that
transforms `floor(x) = N` into `x >= N AND x < N+1`. This enables
statistics-based predicate pushdown for queries using floor().

For example, a query like:
  SELECT * FROM t WHERE floor(price) = 100

Is rewritten to:
  SELECT * FROM t WHERE price >= 100 AND price < 101

This allows the query engine to leverage min/max statistics from
Parquet row groups, significantly reducing the amount of data scanned.

Benchmarks on the ClickBench hits dataset show:
- 80% file pruning (89 out of 111 files skipped)
- 70x fewer rows scanned (1.4M vs 100M)
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR looks good to me, thanks @devanshu0987 and @masonh22 for the review, decimal types support makes sense, WDYT can decimals be addressed in this PR or in followup?

@devanshu0987
Copy link
Contributor Author

devanshu0987 commented Jan 30, 2026

Hi @comphead and @masonh22

Can you advise on what I should do?

  • Preimage result is a Interval::try_new(lower, upper)
  • Inside Interval::try_new, it does a data type assert
          assert_eq_or_internal_err!(
              lower.data_type(),
              upper.data_type(),
              "Endpoints of an Interval should have the same type"
          );
    
  • Data type asserts for Decimal types also include checking precision and scale.
              ScalarValue::Decimal32(_, precision, scale) => {
                  DataType::Decimal32(*precision, *scale)
              }
    
  • When we add 2 Decimals, the precision changes result_precision = max(s1, s2) + max(p1-s1, p2-s2) + 1
    • Which makes sense to handle the overflow case: 99 + 1 = 100
  • Hence, when we add 1 to find the upper bound, the resulting data type is 1 more than the lower bound, and the Interval::try_new fails.
      SELECT arrow_typeof(CAST(100.00 AS DECIMAL(5,2)) + CAST(1.00 AS DECIMAL(5,2)));
      +-----------------------------------------+
      | arrow_typeof(Float64(100) + Float64(1)) |
      +-----------------------------------------+
      | Decimal128(6, 2)                        | 
      +-----------------------------------------+
    
  • I am hesitant about the idea of writing my own manual calculation to keep the precision the same post addition.
    • Scale < 0, Scale > 0 etc.

@masonh22
Copy link
Contributor

@devanshu0987 that's a great point that I didn't consider. My first thought would be to always rescale the min to whatever scale is used by the max using something like rescale_decimal(). I'm not sure if that would be the simplest solution though.

@comphead
Copy link
Contributor

If decimal support for preimage floor is not trivial and makes this PR much more complex, we should address this in followup IMO. I created #20080

@comphead
Copy link
Contributor

@devanshu0987 please address changes for debug_assert and #20059 (comment) and I think the PR is good to go

@devanshu0987
Copy link
Contributor Author

@devanshu0987 that's a great point that I didn't consider. My first thought would be to always rescale the min to whatever scale is used by the max using something like rescale_decimal(). I'm not sure if that would be the simplest solution though.

This feels like it will work.
I will give this a go in the next PR.
Thanks.

@devanshu0987
Copy link
Contributor Author

@devanshu0987 please address changes for debug_assert and #20059 (comment) and I think the PR is good to go

Hi @comphead , I have taken care of the comments. Please take another look.

@sdf-jkl
Copy link
Contributor

sdf-jkl commented Feb 1, 2026

Hi @devanshu0987, I didn't have a chance to take a close look at everything, but I saw you're using unit tests instead of slt tests. I'm working on a similar issue on date_part function #19733, and there I only added new slt tests and no unit tests. A few of those use EXPLAIN to make sure the original function no longer appears in the logical and physical plans.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for helping this PR 👍

@comphead comphead added this pull request to the merge queue Feb 1, 2026
Merged via the queue into apache:main with commit 51c0475 Feb 1, 2026
28 checks passed
@devanshu0987
Copy link
Contributor Author

Hi @devanshu0987, I didn't have a chance to take a close look at everything, but I saw you're using unit tests instead of slt tests. I'm working on a similar issue on date_part function #19733, and there I only added new slt tests and no unit tests. A few of those use EXPLAIN to make sure the original function no longer appears in the logical and physical plans.

Correctness is proved by existing floor slt tests.
Although you are right, it would be a good idea to validate EXPLAIN output as well.
I will add them to a new PR and will tag you.

@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

FYI @sdf-jkl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants