Skip to content

[python] Add ReadBuilder.explain() for scan-plan visibility#7869

Open
TheR1sing3un wants to merge 6 commits into
apache:masterfrom
TheR1sing3un:py-pypaimon-explain
Open

[python] Add ReadBuilder.explain() for scan-plan visibility#7869
TheR1sing3un wants to merge 6 commits into
apache:masterfrom
TheR1sing3un:py-pypaimon-explain

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

What

Add ReadBuilder.explain() returning a structured ExplainResult so users
can see what a PyPaimon read will actually do — target snapshot, pushed-down
predicate / projection / limit, partition / bucket / file-stats pruning
funnel, and split-level execution signals (raw-convertible ratio, deletion-
vector ratio, level histogram, split-size skew).

The default __str__ is a compact debug layout; verbose=True lists every
split. Reads manifest list + manifests only — data files are never opened.

Why

Plan exposes only splits and snapshot_id today; FileScanner already
does partition / bucket / file-stats pruning but none of that is visible to
users. The only way to inspect cost is reading INFO logs or walking
plan().splits() by hand. Apache Paimon Java has no SQL EXPLAIN of its own
either (that comes from Flink / Spark); this PR is scoped to scan-plan
visibility, not query planning.

Sample output

PK + partition + HASH_FIXED bucket, predicate dt = '2026-05-12' AND id = 7:

== PyPaimon Scan Plan ==
Table:              default.demo (PK, HASH_FIXED)
Snapshot:           5  (schema 0)
Predicate:          (dt = '2026-05-12') AND (id = 7)
Projection:         [dt, id, val]
Limit:              100

Partition pruning:  20 -> 4  (pruned 16)
Bucket pruning:     4 -> 1  (pruned 3)
File skipping:      1 -> 1  (pruned 0)

Splits:             1
  raw-convertible:  1 / 1
  with DV:          0 / 1
  all-above-L0:     0 / 1
  files/split:      min=1  max=1  avg=1.00
  size/split:       min=2.6 KiB  p50=2.6 KiB  p95=2.6 KiB  max=2.6 KiB

Files:              1
Total size:         2.6 KiB
Estimated rows:     10   (merged: 10)
Level histogram:    L0=1
Deletion files:     0

Tests

pypaimon/tests/read_builder_explain_test.py covers 7 scenarios:
append-only baseline, PK/partition/bucket pruning funnel, predicate
rendering, verbose splits, empty snapshot, split-level signals, pretty-print
smoke. Full read regression is clean.

API / format impact

New API only: ReadBuilder.explain(verbose=False) -> ExplainResult. Hot
read path untouched — ScanStats is opt-in and only enabled by explain().
No data / wire format change. No Java-side change.

Follow-up

A follow-up patch will surface explain through the pypaimon CLI
(alongside cli_sql / cli_table) so users can inspect a query plan from
the command line without writing any Python. A # TODO next to
ReadBuilder.explain marks the entry point.

Generative AI usage

Drafted with the help of Claude Code; reviewed and tested locally by the
author.

Introduce ReadBuilder.explain() returning a structured ExplainResult that
summarises the target snapshot, the pushed-down predicate / projection /
limit, the partition / bucket / file-stats pruning funnel, and split-
level execution signals (raw-convertible ratio, deletion-vector ratio,
level histogram, files-per-split and split-size distribution).

A new opt-in ScanStats counter set is wired through FileScanner via
TableScan.scan_with_stats(). The regular read hot path is unaffected
when scan_stats is None. To produce accurate before/after counters,
explain() suppresses the manifest reader's early bucket filter and
forces single-threaded manifest decoding for the one pass that drives
it. The order of partition and bucket checks in _filter_manifest_entry
is rearranged so each pruning stage maps cleanly to one counter; both
filters remain pure AND tests and the final survivor set is identical.

Predicate rendering lives in a standalone helper so Predicate itself
stays rendering-agnostic.
Cover the seven scenarios called out in the design: append-only baseline,
PK partitioned + HASH_FIXED with predicate that triggers both partition
and bucket pruning, predicate rendering shapes (equal, in, between,
isNull, and/or), verbose split detail alignment with plan().splits(),
empty snapshot path, split-level signals (raw-convertible / DV / L0)
across append-only and DV-on PK tables, and pretty-print smoke for the
compact layout anchors.
Comment thread paimon-python/README.md Outdated
@TheR1sing3un TheR1sing3un requested a review from JingsongLi May 16, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants