Skip to content

ParquetSet is misleading and not interoperable with the Python data ecosystem #228

@jturingdev

Description

@jturingdev

Problem: ParquetSet is not discoverable nor interoperable

The object returned by sinan.download() is a ParquetSet, but its current API makes it unnecessarily hard to use and violates common Python and data ecosystem conventions.

Current issues

The ParquetSet object:

  • Prints a filesystem path via __str__(), which misleads users into assuming it is a path-like object
  • Is not iterable, breaking standard Python expectations for a “set”-like container
  • Does not expose any explicit path attributes (.path, .paths, .files)
  • Is not compatible with pandas or polars readers
  • Does not document the correct way to load the underlying parquet data

As a result, users are forced to reverse-engineer the object behavior, effectively turning them into testers.

Violated principles

  • Principle of Least Surprise
  • Self-describing API
  • Interoperability with the Python data ecosystem

Proposed solution

Implement the filesystem protocol by adding __fspath__ to ParquetSet:

class ParquetSet:
    def __fspath__(self):
        return str(self)

This small change would immediately enable native compatibility with:

pd.read_parquet(files)
pl.read_parquet(files)
pl.scan_parquet(files)

No breaking changes, no new abstractions, and no additional documentation burden.


Benefits

  • Restores expected Python behavior
  • Enables seamless integration with pandas and polars
  • Reduces API surface and user confusion
  • Eliminates the need for helper methods such as to_dataframe()
  • Improves usability without altering internal design

This change optimizes developer experience while preserving the original intent of ParquetSet.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions