Skip to content

Add option for lazily evaluating data#491

Merged
mallport merged 12 commits intomainfrom
add-lazy
Feb 25, 2026
Merged

Add option for lazily evaluating data#491
mallport merged 12 commits intomainfrom
add-lazy

Conversation

@mallport
Copy link
Contributor

@mallport mallport commented Feb 24, 2026

This change makes it so that the from_polars() method can accept a Polars LazyFrame. Only the columns that are pseudonymized should be materialized, the rest is chunked and streamed directly from bucket-to-bucket

This change is Reviewable

gs_path = GSPath(cloud_path=file_path, client=client)
cloud_path = GSPath(cloud_path=file_path, client=client)

file_handle = gs_path.open(mode="wb")
Copy link
Contributor Author

@mallport mallport Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we upgraded Polars, we no longer have to use file handles for writing to GCS, we can just supply the "gs://"-filepath and Polars automatically infers the authentication from the environment

Copy link
Contributor

@skykanin skykanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykanin made 4 comments.
Reviewable status: 0 of 19 files reviewed, 5 unresolved discussions (waiting on mallport).


src/dapla_pseudo/v1/pseudo.py line 122 at r2 (raw file):

            if hierarchical and type(Pseudonymize.dataset) is pl.LazyFrame:
                raise ValueError(
                    "Hierarchical datasets are not supported for Polars LazyFrames."

We should probably note this in the dapla manual


src/dapla_pseudo/v1/result.py line 157 at r2 (raw file):

                return pandas_df
            case pl.LazyFrame() as ldf:
                pandas_df = ldf.collect().to_pandas()

.collect realises the lazyframe as a dataframe. This might be a footgun for some users and it's probably never what you want if you're using a lazyframe to begin with. We should at least document this behaviour in the pydoc comment for to_pandas or decide not to support this case and throw an exception


tests/v1/integration/test_pseudonymize.py line 152 at r2 (raw file):

@pytest.mark.usefixtures("setup")
@integration_test()

Would it be possible to add a regression test to ensure we're using an expected amount of memory when loading large datasets as a lazyframe and pseudonymizing a few columns?

Copy link
Contributor

@skykanin skykanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykanin reviewed 21 files and all commit messages, made 2 comments, and resolved 2 discussions.
Reviewable status: 21 of 22 files reviewed, 5 unresolved discussions (waiting on mallport).


tests/v1/integration/conftest.py line 20 at r3 (raw file):

        # Need to disable local file logging to avoid getting gcloud perm error
        # subprocess.run(["gcloud", "config", "set", "core/disable_file_logging", "True"])

This should be uncommented again


tests/v1/integration/test_lazy_regression.py line 53 at r4 (raw file):

@pytest.mark.usefixtures("setup")
@integration_test()
def test_lazy_projection_memory_regression() -> None:

We should test the difference in memory usage between loading and pseudonymizing the same data in a lazyframe vs a regular dataframe

Copy link
Contributor

@skykanin skykanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@skykanin reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on mallport).

@mallport mallport merged commit 2388b78 into main Feb 25, 2026
18 of 19 checks passed
@mallport mallport deleted the add-lazy branch February 25, 2026 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants