Skip to content

Python connector: add first-class table handles and object-based Arrow/CDF read APIs #860

@zacdav-db

Description

@zacdav-db

Summary

The Python connector currently makes users construct "<profile>#<share>.<schema>.<table>" strings for common read paths. That works, but it is awkward, easy to get wrong, and makes object-oriented workflows difficult.

Arrow-native consumption is also underexposed. This makes integration with Arrow consumers such as DuckDB less direct than it should be, and it forces users toward eager materialization even when they want batch-oriented reads.

The same problem exists for change data feed: CDF is still only exposed through legacy free functions, so it does not participate in the new object model.

Motivation

Today, a representative Python workflow looks like this:

profile_file = "recipient.share"
table_url = profile_file + "#share.schema.table"
data = delta_sharing.load_as_pandas(table_url, limit=10)

Pain points:

  • Users must manually build and parse table_url strings.
  • The API shape does not reflect the underlying concepts already present in the connector (SharingClient, Table, snapshots).
  • Arrow-native use cases are not first-class.
  • Lazy batch-oriented consumption for engines like DuckDB is not easy to discover.
  • CDF is disconnected from the new table-oriented object model.

Proposal

Add an additive object-based API alongside the existing URL-based API.

Snapshot surface

client = delta_sharing.SharingClient("recipient.share")
table = client.table("share.schema.table")

pdf = table.snapshot(limit=10).to_pandas()
arrow_table = table.snapshot(limit=10).to_arrow()
batches = table.snapshot(limit=10).to_record_batches()
reader = table.snapshot(limit=10).to_record_batch_reader()

Also add a URL-based Arrow helper for parity:

arrow_table = delta_sharing.load_as_arrow("recipient.share#share.schema.table", limit=10)

CDF surface

client = delta_sharing.SharingClient("recipient.share")
table = client.table("share.schema.table")

changes = table.changes(starting_version=5)
pdf = changes.to_pandas()
arrow_table = changes.to_arrow()
reader = changes.to_record_batch_reader()

Design goals

  • Keep the existing URL-based APIs working unchanged.
  • Make the new API additive, not a replacement.
  • Keep query configuration on snapshot(...) and changes(...), with to_*() methods acting as materializers.
  • Support both eager Arrow materialization and lazy Arrow batch consumption.
  • Make it easy for engines like DuckDB to consume a RecordBatchReader directly.
  • Bring CDF into the same object model without changing legacy CDF semantics.

Compatibility requirements

This should not disrupt existing users.

  • load_as_pandas(...) remains supported.
  • load_table_changes_as_pandas(...) remains supported.
  • The "<profile>#<share>.<schema>.<table>" format remains supported.
  • New examples should demonstrate the object-based API.
  • Existing syntax should remain documented as a compatibility path.

Implementation notes

The implementation should make pandas an adapter over shared reader logic rather than the only primary surface.

In particular:

  • SharingClient.table("share.schema.table") should return a first-class table handle.
  • table.snapshot(...) should configure snapshot reads.
  • to_arrow(), to_record_batches(), and to_record_batch_reader() should share a common Arrow read path.
  • table.changes(...) should mirror table.snapshot(...) and expose the same materializers.
  • Legacy CDF behavior should be preserved: only use delta format when explicitly requested.

Docs and examples

If we proceed, the PR should include:

  • Python README updates for the new table-handle, snapshot, Arrow, and CDF APIs.
  • Example updates showing snapshot-oriented pandas and Arrow syntax.
  • A new Arrow quickstart that demonstrates to_arrow, to_record_batches, to_record_batch_reader, and DuckDB integration.
  • Any extra example dependency requirements, such as duckdb, documented explicitly.

Validation

The PR should include:

  • Unit tests for Arrow table reads.
  • Unit tests for lazy RecordBatch and RecordBatchReader reads.
  • A regression test asserting the legacy load_as_pandas(...) result matches the new table-handle snapshot(...).to_pandas(...) result for the same table.
  • CDF tests covering table.changes(...).to_pandas(), to_arrow(), to_record_batches(), and to_record_batch_reader().

Open questions

Is client.table("share.schema.table") the right naming, and is table.changes(...) the right extension point for object-based CDF?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions