Conversation
| gs_path = GSPath(cloud_path=file_path, client=client) | ||
| cloud_path = GSPath(cloud_path=file_path, client=client) | ||
|
|
||
| file_handle = gs_path.open(mode="wb") |
There was a problem hiding this comment.
After we upgraded Polars, we no longer have to use file handles for writing to GCS, we can just supply the "gs://"-filepath and Polars automatically infers the authentication from the environment
skykanin
left a comment
There was a problem hiding this comment.
@skykanin made 4 comments.
Reviewable status: 0 of 19 files reviewed, 5 unresolved discussions (waiting on mallport).
src/dapla_pseudo/v1/pseudo.py line 122 at r2 (raw file):
if hierarchical and type(Pseudonymize.dataset) is pl.LazyFrame: raise ValueError( "Hierarchical datasets are not supported for Polars LazyFrames."
We should probably note this in the dapla manual
src/dapla_pseudo/v1/result.py line 157 at r2 (raw file):
return pandas_df case pl.LazyFrame() as ldf: pandas_df = ldf.collect().to_pandas()
.collect realises the lazyframe as a dataframe. This might be a footgun for some users and it's probably never what you want if you're using a lazyframe to begin with. We should at least document this behaviour in the pydoc comment for to_pandas or decide not to support this case and throw an exception
tests/v1/integration/test_pseudonymize.py line 152 at r2 (raw file):
@pytest.mark.usefixtures("setup") @integration_test()
Would it be possible to add a regression test to ensure we're using an expected amount of memory when loading large datasets as a lazyframe and pseudonymizing a few columns?
skykanin
left a comment
There was a problem hiding this comment.
@skykanin reviewed 21 files and all commit messages, made 2 comments, and resolved 2 discussions.
Reviewable status: 21 of 22 files reviewed, 5 unresolved discussions (waiting on mallport).
tests/v1/integration/conftest.py line 20 at r3 (raw file):
# Need to disable local file logging to avoid getting gcloud perm error # subprocess.run(["gcloud", "config", "set", "core/disable_file_logging", "True"])
This should be uncommented again
tests/v1/integration/test_lazy_regression.py line 53 at r4 (raw file):
@pytest.mark.usefixtures("setup") @integration_test() def test_lazy_projection_memory_regression() -> None:
We should test the difference in memory usage between loading and pseudonymizing the same data in a lazyframe vs a regular dataframe
skykanin
left a comment
There was a problem hiding this comment.
@skykanin reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on mallport).
This change makes it so that the
from_polars()method can accept a PolarsLazyFrame. Only the columns that are pseudonymized should be materialized, the rest is chunked and streamed directly from bucket-to-bucketThis change is