Skip to content

feat: add to_lazy_polars to Dataset#116

Merged
nicornk merged 9 commits intomainfrom
feat/lazy-polars-dataset
Feb 20, 2026
Merged

feat: add to_lazy_polars to Dataset#116
nicornk merged 9 commits intomainfrom
feat/lazy-polars-dataset

Conversation

@nicornk
Copy link
Contributor

@nicornk nicornk commented Feb 19, 2026

Summary

@bernhardschaefer Can you test this please?

Checklist

  • You agree with our CLA
  • Included tests (or is not applicable).
  • Updated documentation (or is not applicable).
  • Used pre-commit hooks to format and lint the code.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a to_lazy_polars() method to the Dataset class, enabling users to access dataset data as a polars LazyFrame for efficient query execution. Unlike the existing to_polars() method which uses FoundrySqlServer, this new method reads data directly from S3 using polars' lazy evaluation API.

Changes:

  • Added to_lazy_polars() method to Dataset class that returns a polars LazyFrame by directly accessing parquet files from S3
  • Added test infrastructure for two new test datasets (iris_hive_partitioned and iris_parquet) with corresponding schema definition (IRIS_SCHEMA_HIVE)
  • Added integration tests to verify the method works with both regular parquet and hive-partitioned datasets

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
libs/foundry-dev-tools/src/foundry_dev_tools/resources/dataset.py Implements the to_lazy_polars() method with S3-based data access, error handling for datasets without transactions, and comprehensive documentation
tests/integration/utils.py Adds IRIS_SCHEMA_HIVE constant and test dataset fixtures (iris_hive_partitioned, iris_parquet) for integration testing
tests/integration/resources/test_dataset.py Adds two integration tests verifying the method works correctly with parquet and hive-partitioned datasets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +803 to +824
def to_lazy_polars(self, transaction_rid: str | None = None) -> pl.LazyFrame:
"""Get dataset as a :py:class:`polars.LazyFrame`.

Returns a lazy polars DataFrame that can be queried efficiently using
polars' lazy evaluation API. The data is accessed directly from S3
without going through FoundrySqlServer.

Example:
>>> ds = ctx.get_dataset_by_path("/path/to/dataset")
>>> lf = ds.to_lazy_polars()
>>> result = lf.filter(pl.col("age") > 25).select(["name", "age"])
>>> # Execute and collect results
>>> df = result.collect()

Returns:
pl.LazyFrame: A lazy polars DataFrame

Note:
This method uses the S3 API to directly access dataset files.
For hive-partitioned datasets, polars will automatically read
the partition structure.
"""
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is missing an "Args:" section to document the transaction_rid parameter. Following the existing codebase convention (as seen in methods like transaction_context and upload_schema), methods with parameters should document them in an Args section. Please add documentation explaining what transaction_rid is and when it should be provided.

Copilot uses AI. Check for mistakes.
@bernhardschaefer
Copy link
Contributor

My tests are passing :)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +840 to +841
msg = f"Dataset has no committed transactions: {self.rid=}"
raise DatasetHasNoTransactionsError(info=msg)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_lazy_polars() raises DatasetHasNoTransactionsError but the message it constructs is about "no committed transactions". The exception's default message is still "Dataset has no transactions.", which can be misleading in the (common) case where the dataset has an OPEN transaction but no committed ones yet. Consider overriding the error's message here (pass message="Dataset has no committed transactions.") or adjust the wording so the header and info are consistent.

Suggested change
msg = f"Dataset has no committed transactions: {self.rid=}"
raise DatasetHasNoTransactionsError(info=msg)
base_msg = "Dataset has no committed transactions."
info_msg = f"{base_msg} {self.rid=}"
raise DatasetHasNoTransactionsError(message=base_msg, info=info_msg)

Copilot uses AI. Check for mistakes.
"""Returns the last transaction or None if there are no transactions.

Args:
include_open_exclusive_transaction: If True, includes open transactions
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new include_open_exclusive_transaction param is documented as including "open transactions". Since the underlying API flag is specifically for open exclusive transactions, it would be clearer to mirror that terminology in the docstring to avoid implying it affects other transaction states.

Suggested change
include_open_exclusive_transaction: If True, includes open transactions
include_open_exclusive_transaction: If True, includes open exclusive transactions

Copilot uses AI. Check for mistakes.
@nicornk nicornk merged commit 594d1ea into main Feb 20, 2026
3 checks passed
@nicornk nicornk deleted the feat/lazy-polars-dataset branch February 20, 2026 07:55
nicornk added a commit that referenced this pull request Feb 20, 2026
* feat: add to_lazy_polars to Dataset

* remove not required variables

* add test data

* add unit test for exception

* allow passing arbitrary transaction_rid

* add doc entry

* fix docstring

* fix: get_last_transaction() now excludes open transactions in to_lazy_polars()

* test with explicit transaction passing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants