Implement preimage for floor function to enable predicate pushdown #20059

devanshu0987 · 2026-01-29T07:32:28Z

This adds a preimage implementation for the floor() function that transforms floor(x) = N into x >= N AND x < N+1. This enables statistics-based predicate pushdown for queries using floor().

For example, a query like:
SELECT * FROM t WHERE floor(price) = 100

Is rewritten to:
SELECT * FROM t WHERE price >= 100 AND price < 101

This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned.

Benchmarks on the ClickBench hits dataset show:

80% file pruning (89 out of 111 files skipped)
70x fewer rows scanned (1.4M vs 100M)

CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'benchmarks/data/hits_partitioned/';

-- Test the floor preimage optimization
EXPLAIN ANALYZE SELECT COUNT(*) FROM hits WHERE floor(CAST("CounterID" AS DOUBLE)) = 62;

Metric	Before (no preimage)	After (with preimage)
Files pruned	111 → 111 (0 pruned)	111 → 22 (89 pruned)
Row groups pruned	325 → 325 (0 pruned)	51 → 4 (47 pruned)
Rows scanned	99,997,497	1,410,000
Output rows	738,172	738,172
Pruning predicate	None	CAST(CounterID_max) >= 62 AND CAST(CounterID_min) < 63

Which issue does this PR close?

Closes #.

Rationale for this change

#19946

This epic introduced the pre-image API. This PR is using the pre-image API to provide it for floor function where it is applicable.

What changes are included in this PR?

Are these changes tested?

Unit Tests added
Existing SLT tests pass for this.

Are there any user-facing changes?

No

This adds a `preimage` implementation for the `floor()` function that transforms `floor(x) = N` into `x >= N AND x < N+1`. This enables statistics-based predicate pushdown for queries using floor(). For example, a query like: SELECT * FROM t WHERE floor(price) = 100 Is rewritten to: SELECT * FROM t WHERE price >= 100 AND price < 101 This allows the query engine to leverage min/max statistics from Parquet row groups, significantly reducing the amount of data scanned. Benchmarks on the ClickBench hits dataset show: - 80% file pruning (89 out of 111 files skipped) - 70x fewer rows scanned (1.4M vs 100M)

datafusion/functions/src/math/floor.rs

comphead

the PR looks good to me, thanks @devanshu0987 and @masonh22 for the review, decimal types support makes sense, WDYT can decimals be addressed in this PR or in followup?

devanshu0987 · 2026-01-30T01:36:38Z

Hi @comphead and @masonh22

Can you advise on what I should do?

Preimage result is a Interval::try_new(lower, upper)

Inside Interval::try_new, it does a data type assert

      assert_eq_or_internal_err!(
          lower.data_type(),
          upper.data_type(),
          "Endpoints of an Interval should have the same type"
      );

Data type asserts for Decimal types also include checking precision and scale.

          ScalarValue::Decimal32(_, precision, scale) => {
              DataType::Decimal32(*precision, *scale)
          }

When we add 2 Decimals, the precision changes result_precision = max(s1, s2) + max(p1-s1, p2-s2) + 1
- Which makes sense to handle the overflow case: 99 + 1 = 100

Hence, when we add 1 to find the upper bound, the resulting data type is 1 more than the lower bound, and the Interval::try_new fails.

  SELECT arrow_typeof(CAST(100.00 AS DECIMAL(5,2)) + CAST(1.00 AS DECIMAL(5,2)));
  +-----------------------------------------+
  | arrow_typeof(Float64(100) + Float64(1)) |
  +-----------------------------------------+
  | Decimal128(6, 2)                        | 
  +-----------------------------------------+

I am hesitant about the idea of writing my own manual calculation to keep the precision the same post addition.
- Scale < 0, Scale > 0 etc.

masonh22 · 2026-01-30T13:43:36Z

@devanshu0987 that's a great point that I didn't consider. My first thought would be to always rescale the min to whatever scale is used by the max using something like rescale_decimal(). I'm not sure if that would be the simplest solution though.

datafusion/functions/src/math/floor.rs

comphead · 2026-01-30T18:18:00Z

If decimal support for preimage floor is not trivial and makes this PR much more complex, we should address this in followup IMO. I created #20080

datafusion/functions/src/math/floor.rs

comphead · 2026-01-31T16:42:30Z

@devanshu0987 please address changes for debug_assert and #20059 (comment) and I think the PR is good to go

Co-authored-by: Oleks V <[email protected]>

devanshu0987 · 2026-02-01T01:34:02Z

@devanshu0987 that's a great point that I didn't consider. My first thought would be to always rescale the min to whatever scale is used by the max using something like rescale_decimal(). I'm not sure if that would be the simplest solution though.

This feels like it will work.
I will give this a go in the next PR.
Thanks.

devanshu0987 · 2026-02-01T01:34:51Z

@devanshu0987 please address changes for debug_assert and #20059 (comment) and I think the PR is good to go

Hi @comphead , I have taken care of the comments. Please take another look.

sdf-jkl · 2026-02-01T03:09:36Z

Hi @devanshu0987, I didn't have a chance to take a close look at everything, but I saw you're using unit tests instead of slt tests. I'm working on a similar issue on date_part function #19733, and there I only added new slt tests and no unit tests. A few of those use EXPLAIN to make sure the original function no longer appears in the logical and physical plans.

comphead

Thanks everyone for helping this PR 👍

devanshu0987 · 2026-02-01T04:34:10Z

Hi @devanshu0987, I didn't have a chance to take a close look at everything, but I saw you're using unit tests instead of slt tests. I'm working on a similar issue on date_part function #19733, and there I only added new slt tests and no unit tests. A few of those use EXPLAIN to make sure the original function no longer appears in the logical and physical plans.

Correctness is proved by existing floor slt tests.
Although you are right, it would be a good idea to validate EXPLAIN output as well.
I will add them to a new PR and will tag you.

alamb · 2026-02-02T16:56:31Z

FYI @sdf-jkl

github-actions bot added the functions Changes to functions implementation label Jan 29, 2026

devanshu0987 force-pushed the floor-preimage branch from bd9b68f to 5c4c771 Compare January 29, 2026 07:37

Devanshu and others added 2 commits January 29, 2026 14:47

Fix clippy warnings: inline format arguments in floor tests

1a025d4

Merge branch 'main' into floor-preimage

fcad1d6

masonh22 reviewed Jan 29, 2026

View reviewed changes