Add option for lazily evaluating data by mallport · Pull Request #491 · statisticsnorway/dapla-toolbelt-pseudo

mallport · 2026-02-24T14:10:17Z

This change makes it so that the from_polars() method can accept a Polars LazyFrame. Only the columns that are pseudonymized should be materialized, the rest is chunked and streamed directly from bucket-to-bucket

This change is

mallport · 2026-02-24T14:14:10Z

src/dapla_pseudo/v1/result.py

-            gs_path = GSPath(cloud_path=file_path, client=client)
+            cloud_path = GSPath(cloud_path=file_path, client=client)

-            file_handle = gs_path.open(mode="wb")


After we upgraded Polars, we no longer have to use file handles for writing to GCS, we can just supply the "gs://"-filepath and Polars automatically infers the authentication from the environment

README.md

skykanin

@skykanin made 4 comments.
Reviewable status: 0 of 19 files reviewed, 5 unresolved discussions (waiting on mallport).

src/dapla_pseudo/v1/pseudo.py line 122 at r2 (raw file):

            if hierarchical and type(Pseudonymize.dataset) is pl.LazyFrame:
                raise ValueError(
                    "Hierarchical datasets are not supported for Polars LazyFrames."

We should probably note this in the dapla manual

src/dapla_pseudo/v1/result.py line 157 at r2 (raw file):

                return pandas_df
            case pl.LazyFrame() as ldf:
                pandas_df = ldf.collect().to_pandas()

.collect realises the lazyframe as a dataframe. This might be a footgun for some users and it's probably never what you want if you're using a lazyframe to begin with. We should at least document this behaviour in the pydoc comment for to_pandas or decide not to support this case and throw an exception

tests/v1/integration/test_pseudonymize.py line 152 at r2 (raw file):

@pytest.mark.usefixtures("setup")
@integration_test()

Would it be possible to add a regression test to ensure we're using an expected amount of memory when loading large datasets as a lazyframe and pseudonymizing a few columns?

README.md

skykanin

@skykanin reviewed 21 files and all commit messages, made 2 comments, and resolved 2 discussions.
Reviewable status: 21 of 22 files reviewed, 5 unresolved discussions (waiting on mallport).

tests/v1/integration/conftest.py line 20 at r3 (raw file):

        # Need to disable local file logging to avoid getting gcloud perm error
        # subprocess.run(["gcloud", "config", "set", "core/disable_file_logging", "True"])

This should be uncommented again

tests/v1/integration/test_lazy_regression.py line 53 at r4 (raw file):

@pytest.mark.usefixtures("setup")
@integration_test()
def test_lazy_projection_memory_regression() -> None:

We should test the difference in memory usage between loading and pseudonymizing the same data in a lazyframe vs a regular dataframe

skykanin

@skykanin reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on mallport).

Add option for lazily evaluating data

7112e3d

mallport commented Feb 24, 2026

View reviewed changes

mallport added 4 commits February 24, 2026 15:52

fixup

dea81f8

Add README example

6c08b08

make mypy happy

4287a50

bumper

989e046

mallport commented Feb 24, 2026

View reviewed changes

README.md Show resolved Hide resolved

skykanin requested changes Feb 25, 2026

View reviewed changes

README.md Show resolved Hide resolved

mallport added 4 commits February 25, 2026 10:41

incorporate improvements

3c14995

fix tests

377f280

fixer

9ee37b3

Remove ugly env methods

08319c3

skykanin requested changes Feb 25, 2026

View reviewed changes

mallport added 3 commits February 25, 2026 11:58

Actually test lazyframe vs dataframe

a470a04

fix names and add docstrings

4620bfe

Upgrade to 100 rows

8ae410f

skykanin approved these changes Feb 25, 2026

View reviewed changes

mallport merged commit 2388b78 into main Feb 25, 2026
18 of 19 checks passed

mallport deleted the add-lazy branch February 25, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for lazily evaluating data#491

Add option for lazily evaluating data#491
mallport merged 12 commits intomainfrom
add-lazy

mallport commented Feb 24, 2026 •

edited

Loading

Uh oh!

mallport Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

skykanin left a comment

Uh oh!

Uh oh!

skykanin left a comment

Uh oh!

skykanin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mallport commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mallport Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skykanin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skykanin left a comment

Choose a reason for hiding this comment

Uh oh!

skykanin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mallport commented Feb 24, 2026 •

edited

Loading

mallport Feb 24, 2026 •

edited

Loading