-
-
Notifications
You must be signed in to change notification settings - Fork 75
Open
Labels
Description
Problem: ParquetSet is not discoverable nor interoperable
The object returned by sinan.download() is a ParquetSet, but its current API makes it unnecessarily hard to use and violates common Python and data ecosystem conventions.
Current issues
The ParquetSet object:
- Prints a filesystem path via
__str__(), which misleads users into assuming it is a path-like object - Is not iterable, breaking standard Python expectations for a “set”-like container
- Does not expose any explicit path attributes (
.path,.paths,.files) - Is not compatible with pandas or polars readers
- Does not document the correct way to load the underlying parquet data
As a result, users are forced to reverse-engineer the object behavior, effectively turning them into testers.
Violated principles
- Principle of Least Surprise
- Self-describing API
- Interoperability with the Python data ecosystem
Proposed solution
Implement the filesystem protocol by adding __fspath__ to ParquetSet:
class ParquetSet:
def __fspath__(self):
return str(self)This small change would immediately enable native compatibility with:
pd.read_parquet(files)
pl.read_parquet(files)
pl.scan_parquet(files)No breaking changes, no new abstractions, and no additional documentation burden.
Benefits
- Restores expected Python behavior
- Enables seamless integration with pandas and polars
- Reduces API surface and user confusion
- Eliminates the need for helper methods such as
to_dataframe() - Improves usability without altering internal design
This change optimizes developer experience while preserving the original intent of ParquetSet.