Summary
The Python connector currently makes users construct "<profile>#<share>.<schema>.<table>" strings for common read paths. That works, but it is awkward, easy to get wrong, and makes object-oriented workflows difficult.
Arrow-native consumption is also underexposed. This makes integration with Arrow consumers such as DuckDB less direct than it should be, and it forces users toward eager materialization even when they want batch-oriented reads.
The same problem exists for change data feed: CDF is still only exposed through legacy free functions, so it does not participate in the new object model.
Motivation
Today, a representative Python workflow looks like this:
profile_file = "recipient.share"
table_url = profile_file + "#share.schema.table"
data = delta_sharing.load_as_pandas(table_url, limit=10)
Pain points:
- Users must manually build and parse
table_url strings.
- The API shape does not reflect the underlying concepts already present in the connector (
SharingClient, Table, snapshots).
- Arrow-native use cases are not first-class.
- Lazy batch-oriented consumption for engines like DuckDB is not easy to discover.
- CDF is disconnected from the new table-oriented object model.
Proposal
Add an additive object-based API alongside the existing URL-based API.
Snapshot surface
client = delta_sharing.SharingClient("recipient.share")
table = client.table("share.schema.table")
pdf = table.snapshot(limit=10).to_pandas()
arrow_table = table.snapshot(limit=10).to_arrow()
batches = table.snapshot(limit=10).to_record_batches()
reader = table.snapshot(limit=10).to_record_batch_reader()
Also add a URL-based Arrow helper for parity:
arrow_table = delta_sharing.load_as_arrow("recipient.share#share.schema.table", limit=10)
CDF surface
client = delta_sharing.SharingClient("recipient.share")
table = client.table("share.schema.table")
changes = table.changes(starting_version=5)
pdf = changes.to_pandas()
arrow_table = changes.to_arrow()
reader = changes.to_record_batch_reader()
Design goals
- Keep the existing URL-based APIs working unchanged.
- Make the new API additive, not a replacement.
- Keep query configuration on
snapshot(...) and changes(...), with to_*() methods acting as materializers.
- Support both eager Arrow materialization and lazy Arrow batch consumption.
- Make it easy for engines like DuckDB to consume a
RecordBatchReader directly.
- Bring CDF into the same object model without changing legacy CDF semantics.
Compatibility requirements
This should not disrupt existing users.
load_as_pandas(...) remains supported.
load_table_changes_as_pandas(...) remains supported.
- The
"<profile>#<share>.<schema>.<table>" format remains supported.
- New examples should demonstrate the object-based API.
- Existing syntax should remain documented as a compatibility path.
Implementation notes
The implementation should make pandas an adapter over shared reader logic rather than the only primary surface.
In particular:
SharingClient.table("share.schema.table") should return a first-class table handle.
table.snapshot(...) should configure snapshot reads.
to_arrow(), to_record_batches(), and to_record_batch_reader() should share a common Arrow read path.
table.changes(...) should mirror table.snapshot(...) and expose the same materializers.
- Legacy CDF behavior should be preserved: only use delta format when explicitly requested.
Docs and examples
If we proceed, the PR should include:
- Python README updates for the new table-handle, snapshot, Arrow, and CDF APIs.
- Example updates showing snapshot-oriented pandas and Arrow syntax.
- A new Arrow quickstart that demonstrates
to_arrow, to_record_batches, to_record_batch_reader, and DuckDB integration.
- Any extra example dependency requirements, such as
duckdb, documented explicitly.
Validation
The PR should include:
- Unit tests for Arrow table reads.
- Unit tests for lazy
RecordBatch and RecordBatchReader reads.
- A regression test asserting the legacy
load_as_pandas(...) result matches the new table-handle snapshot(...).to_pandas(...) result for the same table.
- CDF tests covering
table.changes(...).to_pandas(), to_arrow(), to_record_batches(), and to_record_batch_reader().
Open questions
Is client.table("share.schema.table") the right naming, and is table.changes(...) the right extension point for object-based CDF?
Summary
The Python connector currently makes users construct
"<profile>#<share>.<schema>.<table>"strings for common read paths. That works, but it is awkward, easy to get wrong, and makes object-oriented workflows difficult.Arrow-native consumption is also underexposed. This makes integration with Arrow consumers such as DuckDB less direct than it should be, and it forces users toward eager materialization even when they want batch-oriented reads.
The same problem exists for change data feed: CDF is still only exposed through legacy free functions, so it does not participate in the new object model.
Motivation
Today, a representative Python workflow looks like this:
Pain points:
table_urlstrings.SharingClient,Table, snapshots).Proposal
Add an additive object-based API alongside the existing URL-based API.
Snapshot surface
Also add a URL-based Arrow helper for parity:
CDF surface
Design goals
snapshot(...)andchanges(...), withto_*()methods acting as materializers.RecordBatchReaderdirectly.Compatibility requirements
This should not disrupt existing users.
load_as_pandas(...)remains supported.load_table_changes_as_pandas(...)remains supported."<profile>#<share>.<schema>.<table>"format remains supported.Implementation notes
The implementation should make pandas an adapter over shared reader logic rather than the only primary surface.
In particular:
SharingClient.table("share.schema.table")should return a first-class table handle.table.snapshot(...)should configure snapshot reads.to_arrow(),to_record_batches(), andto_record_batch_reader()should share a common Arrow read path.table.changes(...)should mirrortable.snapshot(...)and expose the same materializers.Docs and examples
If we proceed, the PR should include:
to_arrow,to_record_batches,to_record_batch_reader, and DuckDB integration.duckdb, documented explicitly.Validation
The PR should include:
RecordBatchandRecordBatchReaderreads.load_as_pandas(...)result matches the new table-handlesnapshot(...).to_pandas(...)result for the same table.table.changes(...).to_pandas(),to_arrow(),to_record_batches(), andto_record_batch_reader().Open questions
Is
client.table("share.schema.table")the right naming, and istable.changes(...)the right extension point for object-based CDF?