Skip to content

Conversation

@Dilli-Babu-Godari
Copy link
Contributor

@Dilli-Babu-Godari Dilli-Babu-Godari commented Jan 29, 2026

Description

Java Scan operator raw input reported wrong

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Added new methods  getDecompressedBytes and getDecompressedPositions in ConnectorPageSource and added three-tier metrics to Correct raw input and processed input accounting for Java-based scan operators, especially for uncompressed sources and lazy block sources.

@Dilli-Babu-Godari Dilli-Babu-Godari requested review from a team and sdruzkin as code owners January 29, 2026 07:25
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Jan 29, 2026
@prestodb-ci prestodb-ci requested review from a team, anandamideShakyan and wanglinsong and removed request for a team January 29, 2026 07:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 29, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Dilli-Babu-Godari / name: Dilli-Babu-Godari (f8821d4)

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 29, 2026

Reviewer's Guide

Introduces three-tier scan metrics (raw/compressed, decompressed/processed input, and output) for Java scan operators by extending ConnectorPageSource with decompressed counters, wiring those through Hive page sources and table/scan operators, and adding tests to validate correct metric relationships for uncompressed and lazy/compressed sources.

Sequence diagram for three-tier scan metrics recording in TableScanOperator

sequenceDiagram
  participant T as TableScanOperator
  participant S as ConnectorPageSource
  participant C as OperatorContext

  T->>S: getNextPage()
  S-->>T: Page

  T->>T: recordInputStats()

  T->>S: getCompletedBytes()
  S-->>T: endCompletedBytes

  T->>S: getCompletedPositions()
  S-->>T: endCompletedPositions

  T->>S: getDecompressedBytes()
  S-->>T: endDecompressedBytes

  T->>S: getReadTimeNanos()
  S-->>T: endReadTimeNanos

  T->>T: compute inputBytes, decompressedInputBytes, positionCount, inputBytesReadTime

  T->>C: recordRawInputWithTiming(inputBytes, positionCount, inputBytesReadTime)
  T->>C: recordProcessedInput(decompressedInputBytes, positionCount)

  T->>T: update completedBytes, completedPositions, decompressedBytes, readTimeNanos
Loading

Class diagram for updated ConnectorPageSource metrics and Java scan operators

classDiagram

class ConnectorPageSource {
  <<interface>>
  +long getCompletedBytes()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
  +long getReadTimeNanos()
  +RuntimeStats getRuntimeStats()
}

class PageFilePageSource {
  -long completedBytes
  -long decompressedBytes
  -long decompressedPositions
  -long readTimeNanos
  -long memoryUsageBytes
  +Page getNextPage()
  +long getCompletedBytes()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class AggregatedParquetPageSource {
  -boolean completed
  -long completedBytes
  -long decompressedBytes
  -long decompressedPositions
  -long readTimeNanos
  +Page getNextPage()
  +long getCompletedBytes()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class RcFilePageSource {
  -int pageId
  -long completedPositions
  -long decompressedBytes
  -long decompressedPositions
  +Page getNextPage()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class AggregatedOrcPageSource {
  -boolean completed
  -long completedBytes
  -long decompressedBytes
  -long readTimeNanos
  +Page getNextPage()
  +long getCompletedBytes()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class OrcBatchPageSource {
  -int batchId
  -long completedPositions
  -long decompressedBytes
  -boolean closed
  +Page getNextPage()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class ParquetPageSource {
  -int batchId
  -long completedPositions
  -long decompressedBytes
  -boolean closed
  +Page getNextPage()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class OrcSelectivePageSource {
  -RowIDCoercer coercer
  -boolean supplyRowIDs
  -OptionalInt rowIDColumnIndex
  -long decompressedBytes
  -boolean closed
  +Page getNextPage()
  +long getCompletedPositions()
  +long getDecompressedBytes()
  +long getDecompressedPositions()
}

class TableScanOperator {
  -ConnectorPageSource source
  -OperatorContext operatorContext
  -long completedBytes
  -long completedPositions
  -long decompressedBytes
  -long readTimeNanos
  +Page getOutput()
  -void recordInputStats()
}

class PageSourceOperator {
  -ConnectorPageSource pageSource
  -OperatorContext operatorContext
  -long completedBytes
  -long decompressedBytes
  -long readTimeNanos
  +Page getOutput()
}

class ScanFilterAndProjectOperator {
  -ConnectorPageSource pageSource
  -OperatorContext operatorContext
  -long completedBytes
  -long completedPositions
  -long readTimeNanos
  -Page recordProcessedInput(Page page)
  -void recordInputStats()
}

class OperatorContext {
  +void recordRawInputWithTiming(long bytes, long positions, long readTimeNanos)
  +void recordProcessedInput(long bytes, long positions)
}

ConnectorPageSource <|.. PageFilePageSource
ConnectorPageSource <|.. AggregatedParquetPageSource
ConnectorPageSource <|.. RcFilePageSource
ConnectorPageSource <|.. AggregatedOrcPageSource
ConnectorPageSource <|.. OrcBatchPageSource
ConnectorPageSource <|.. ParquetPageSource
ConnectorPageSource <|.. OrcSelectivePageSource

TableScanOperator o-- ConnectorPageSource
TableScanOperator o-- OperatorContext
PageSourceOperator o-- ConnectorPageSource
PageSourceOperator o-- OperatorContext
ScanFilterAndProjectOperator o-- ConnectorPageSource
ScanFilterAndProjectOperator o-- OperatorContext
Loading

File-Level Changes

Change Details Files
Extend ConnectorPageSource API to expose decompressed byte/position metrics alongside existing completed metrics, with backwards‑compatible defaults.
  • Clarified Javadoc for getCompletedBytes/getCompletedPositions to represent raw input from storage.
  • Added default methods getDecompressedBytes/getDecompressedPositions that by default delegate to completed metrics so existing connectors remain valid.
presto-spi/src/main/java/com/facebook/presto/spi/ConnectorPageSource.java
Track decompressed metrics in Hive page sources so operators can distinguish compressed raw input from decompressed processed input.
  • Added decompressedBytes/decompressedPositions fields to Hive page sources (pagefile, Parquet, RCFile, ORC batch/selective, aggregated ORC/Parquet).
  • Increment decompressed metrics when constructing result Pages (after decompression / before filtering).
  • Implemented ConnectorPageSource.getDecompressedBytes/getDecompressedPositions in these classes to return tracked values.
presto-hive/src/main/java/com/facebook/presto/hive/pagefile/PageFilePageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/parquet/AggregatedParquetPageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/rcfile/RcFilePageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/orc/AggregatedOrcPageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/orc/OrcBatchPageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSource.java
presto-hive/src/main/java/com/facebook/presto/hive/orc/OrcSelectivePageSource.java
Wire three-tier metrics (raw input, decompressed processed input, and output) into Java scan operators so stats report raw vs processed sizes correctly, especially for uncompressed sources.
  • In TableScanOperator, added decompressedBytes tracking and changed recordInputStats to record raw input from getCompletedBytes and processed input from getDecompressedBytes deltas.
  • In PageSourceOperator, added decompressedBytes tracking and changed accounting to record raw input from completed bytes and processed input from decompressed bytes instead of page size alone.
  • In ScanFilterAndProjectOperator, moved processed-input accounting into recordProcessedInput to use actual block sizes after filtering/projection, and removed processed-input accounting from recordInputStats; also now calls recordInputStats from recordProcessedInput to maintain raw input accounting.
presto-main-base/src/main/java/com/facebook/presto/operator/TableScanOperator.java
presto-main-base/src/main/java/com/facebook/presto/operator/PageSourceOperator.java
presto-main-base/src/main/java/com/facebook/presto/operator/ScanFilterAndProjectOperator.java
Update test page sources and add regression tests to validate decompressed and raw-input metrics behavior for uncompressed and lazy sources.
  • Made SinglePagePageSource return realistic completed bytes/positions and added decompressed* methods that equal completed metrics for uncompressed sources.
  • Enhanced CountingLazyPageSource to track decompressedBytes and expose them via new SPI methods.
  • Added tests covering decompressed metrics tracking on uncompressed sources, three-tier metrics with lazy blocks (completed <= decompressed, output <= decompressed), and equality of raw vs processed metrics for uncompressed sources in scan operators.
presto-main-base/src/test/java/com/facebook/presto/operator/TestScanFilterAndProjectOperator.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In ScanFilterAndProjectOperator, recordProcessedInput(Page) now calls recordInputStats(), so please double-check that recordInputStats() is not also invoked elsewhere on the same operator lifecycle to avoid double-counting raw input bytes/positions.
  • For the new decompressed metrics, consider tracking decompressedPositions consistently wherever you track decompressedBytes (e.g., AggregatedOrcPageSource currently always returns 0 for decompressed positions), so consumers of the API can rely on both values being meaningful when decompression is accounted for.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In ScanFilterAndProjectOperator, recordProcessedInput(Page) now calls recordInputStats(), so please double-check that recordInputStats() is not also invoked elsewhere on the same operator lifecycle to avoid double-counting raw input bytes/positions.
- For the new decompressed metrics, consider tracking decompressedPositions consistently wherever you track decompressedBytes (e.g., AggregatedOrcPageSource currently always returns 0 for decompressed positions), so consumers of the API can rely on both values being meaningful when decompression is accounted for.

## Individual Comments

### Comment 1
<location> `presto-main-base/src/test/java/com/facebook/presto/operator/TestScanFilterAndProjectOperator.java:599-608` </location>
<code_context>
+        // Verify three-tier metrics:
</code_context>

<issue_to_address>
**suggestion (testing):** Also assert positions-related metrics to cover edge cases around decompressed vs completed positions

In `testThreeTierMetricsWithLazyBlocks`, you currently only validate byte-level metrics. Since positions are also tracked (on the page source and via operator stats), please assert those as well to ensure they stay consistent, e.g.: (1) `assertEquals(pageSource.getCompletedPositions(), inputPage.getPositionCount())`, (2) `assertEquals(pageSource.getDecompressedPositions(), inputPage.getPositionCount())`, and (3) `assertEquals(stats.getRawInputPositions(), stats.getInputPositions())` (or the appropriate operator-level accessors). This will help catch regressions where byte accounting is correct but position accounting is not, particularly with lazy loading.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@Dilli-Babu-Godari Dilli-Babu-Godari force-pushed the oss-json-metrics-fix branch 6 times, most recently from 40158bb to c75648b Compare January 29, 2026 14:37
@steveburnett
Copy link
Contributor

Please edit the release note to follow the general level of detail used in previous releases - see the 0.296 release notes for examples.

What would a reader need to know about the change in this PR?

@Dilli-Babu-Godari
Copy link
Contributor Author

Please edit the release note to follow the general level of detail used in previous releases - see the 0.296 release notes for examples.

What would a reader need to know about the change in this PR?

Updated, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants