Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a to_lazy_polars() method to the Dataset class, enabling users to access dataset data as a polars LazyFrame for efficient query execution. Unlike the existing to_polars() method which uses FoundrySqlServer, this new method reads data directly from S3 using polars' lazy evaluation API.
Changes:
- Added
to_lazy_polars()method to Dataset class that returns a polars LazyFrame by directly accessing parquet files from S3 - Added test infrastructure for two new test datasets (iris_hive_partitioned and iris_parquet) with corresponding schema definition (IRIS_SCHEMA_HIVE)
- Added integration tests to verify the method works with both regular parquet and hive-partitioned datasets
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libs/foundry-dev-tools/src/foundry_dev_tools/resources/dataset.py | Implements the to_lazy_polars() method with S3-based data access, error handling for datasets without transactions, and comprehensive documentation |
| tests/integration/utils.py | Adds IRIS_SCHEMA_HIVE constant and test dataset fixtures (iris_hive_partitioned, iris_parquet) for integration testing |
| tests/integration/resources/test_dataset.py | Adds two integration tests verifying the method works correctly with parquet and hive-partitioned datasets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libs/foundry-dev-tools/src/foundry_dev_tools/resources/dataset.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def to_lazy_polars(self, transaction_rid: str | None = None) -> pl.LazyFrame: | ||
| """Get dataset as a :py:class:`polars.LazyFrame`. | ||
|
|
||
| Returns a lazy polars DataFrame that can be queried efficiently using | ||
| polars' lazy evaluation API. The data is accessed directly from S3 | ||
| without going through FoundrySqlServer. | ||
|
|
||
| Example: | ||
| >>> ds = ctx.get_dataset_by_path("/path/to/dataset") | ||
| >>> lf = ds.to_lazy_polars() | ||
| >>> result = lf.filter(pl.col("age") > 25).select(["name", "age"]) | ||
| >>> # Execute and collect results | ||
| >>> df = result.collect() | ||
|
|
||
| Returns: | ||
| pl.LazyFrame: A lazy polars DataFrame | ||
|
|
||
| Note: | ||
| This method uses the S3 API to directly access dataset files. | ||
| For hive-partitioned datasets, polars will automatically read | ||
| the partition structure. | ||
| """ |
There was a problem hiding this comment.
The docstring is missing an "Args:" section to document the transaction_rid parameter. Following the existing codebase convention (as seen in methods like transaction_context and upload_schema), methods with parameters should document them in an Args section. Please add documentation explaining what transaction_rid is and when it should be provided.
|
My tests are passing :) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| msg = f"Dataset has no committed transactions: {self.rid=}" | ||
| raise DatasetHasNoTransactionsError(info=msg) |
There was a problem hiding this comment.
to_lazy_polars() raises DatasetHasNoTransactionsError but the message it constructs is about "no committed transactions". The exception's default message is still "Dataset has no transactions.", which can be misleading in the (common) case where the dataset has an OPEN transaction but no committed ones yet. Consider overriding the error's message here (pass message="Dataset has no committed transactions.") or adjust the wording so the header and info are consistent.
| msg = f"Dataset has no committed transactions: {self.rid=}" | |
| raise DatasetHasNoTransactionsError(info=msg) | |
| base_msg = "Dataset has no committed transactions." | |
| info_msg = f"{base_msg} {self.rid=}" | |
| raise DatasetHasNoTransactionsError(message=base_msg, info=info_msg) |
| """Returns the last transaction or None if there are no transactions. | ||
|
|
||
| Args: | ||
| include_open_exclusive_transaction: If True, includes open transactions |
There was a problem hiding this comment.
The new include_open_exclusive_transaction param is documented as including "open transactions". Since the underlying API flag is specifically for open exclusive transactions, it would be clearer to mirror that terminology in the docstring to avoid implying it affects other transaction states.
| include_open_exclusive_transaction: If True, includes open transactions | |
| include_open_exclusive_transaction: If True, includes open exclusive transactions |
* feat: add to_lazy_polars to Dataset * remove not required variables * add test data * add unit test for exception * allow passing arbitrary transaction_rid * add doc entry * fix docstring * fix: get_last_transaction() now excludes open transactions in to_lazy_polars() * test with explicit transaction passing
Summary
@bernhardschaefer Can you test this please?
Checklist