Skip to content

Performance regression in 0.25.0: per-row checks on Polars LazyFrames are ~10× slower #405

@boccileonardo

Description

@boccileonardo

Description

col_vals_not_null and col_vals_le on a Polars LazyFrame slowed down ~7-11× between
0.24.0 and 0.25.0. The regression is largest when the column has no failures (the
typical pass case), and scales linearly with row count.

Disclaimer: I used AI to format this issue and its associated MRE.

Reproducible example

import time
import polars as pl
import pointblank as pb

N = 2_000_000
lf = pl.DataFrame(
    {
        "pk": pl.int_range(0, N, eager=True),
        "val": pl.int_range(0, N, eager=True),
    }
).lazy()


def t(label, fn):
    fn()  # warm-up
    t0 = time.perf_counter()
    fn()
    print(f"  {label:20s} {time.perf_counter() - t0:.3f}s")


print(f"pointblank={pb.__version__}  polars={pl.__version__}  rows={N:,}")
t("col_vals_not_null", lambda: pb.Validate(data=lf).col_vals_not_null(columns="pk").interrogate())
t("col_vals_le",       lambda: pb.Validate(data=lf).col_vals_le(columns="val", value=N).interrogate())
t("rows_distinct",     lambda: pb.Validate(data=lf).rows_distinct(columns_subset=["pk"]).interrogate())

Run with:

uv run --with 'pointblank==0.24.0' --with 'polars==1.41.2' mre.py
uv run --with 'pointblank==0.25.0' --with 'polars==1.41.2' mre.py

Result

Median of 3 iters on a 2M-row, ~46-column LazyFrame:

check 0.24.0 0.25.0 slowdown
col_vals_not_null ×3 0.007 s 0.077 s 11×
col_vals_le (date ≤ today) 0.004 s 0.031 s
rows_distinct (composite PK) 0.104 s 0.142 s 1.4×
col_schema_match 0.017 s 0.016 s 1.0×

Development environment

Tested on polars==1.41.2, Python 3.12, macOS arm64.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions