Skip to content

Commit fd80d6c

Browse files
authored
[Kernel] [Pagination] New Pagination API (delta-io#4836)
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta/pull/4836/files) to review incremental changes. - [**stack/pagination-api**](delta-io#4836) [[Files changed](https://github.com/delta-io/delta/pull/4836/files)] - [stack/read-page1](delta-io#4837) [[Files changed](https://github.com/delta-io/delta/pull/4837/files/9d95b4dc82006d2883eb11cee98caeb6899aa4e1..ab5ebd100df54e220cabfbc89eb94a520f3c2db3)] - [stack/skipJSON](delta-io#4844) [[Files changed](https://github.com/delta-io/delta/pull/4844/files/ab5ebd100df54e220cabfbc89eb94a520f3c2db3..971a98d84f12031099aa35a77e39a8e87a69b05f)] - [stack/multi-CP](delta-io#4845) [[Files changed](https://github.com/delta-io/delta/pull/4845/files/971a98d84f12031099aa35a77e39a8e87a69b05f..40c4d7d7e0924a7d8225b6a44db7ddbf60b70de6)] - [stack/sidecar](delta-io#4846) [[Files changed](https://github.com/delta-io/delta/pull/4846/files/40c4d7d7e0924a7d8225b6a44db7ddbf60b70de6..8cbb470df1ab57e1fd8c95c3ecd4c9f0db466390)] --------- <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [ ] Spark - [ ] Standalone - [ ] Flink - [x] Kernel - [ ] Other (fill in here) ## Description <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> This PR introduces new API for Kernel Pagination Support. ## How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to test the changes thoroughly including negative and positive cases if possible. If the changes were tested in any way other than unit tests, please clarify how you tested step by step (ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future). If the changes were not tested, please explain why. --> No implementation yet, no tests. ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. --> Yes.
1 parent 4b2a0e6 commit fd80d6c

File tree

4 files changed

+111
-0
lines changed

4 files changed

+111
-0
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
/*
2+
* Copyright (2025) The Delta Lake Project Authors.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
package io.delta.kernel;
17+
18+
import io.delta.kernel.data.FilteredColumnarBatch;
19+
import io.delta.kernel.engine.Engine;
20+
21+
/**
22+
* Extension of {@link Scan} that supports pagination.
23+
*
24+
* <p>This interface allows consumers to retrieve scan results in discrete, ordered pages rather
25+
* than all at once. This is particularly useful for large datasets where materializing the full
26+
* result set would be expensive in terms of memory or compute resources.
27+
*
28+
* <p>Pagination is achieved via a combination of {@code pageSize} and {@code pageToken}. The {@code
29+
* pageSize} controls how many Scan files are returned in each page, while the {@code pageToken}
30+
* encodes the location of next batch to read and is used to resume the scan from exactly where the
31+
* last page ended. For the first page, the {@code pageToken} should be {@code Optional.empty()}.
32+
*
33+
* <p>Consumers typically use {@link PaginatedScan} in a loop: they call {@code getScanFiles()} to
34+
* retrieve an iterator over the current page's scan files. After consuming the iterator, users
35+
* should call {@link PaginatedScanFilesIterator#getCurrentPageToken} to retrieve a token to pass
36+
* into the next page request. This allows users to scan the dataset incrementally, resuming from
37+
* where they left off.
38+
*/
39+
public interface PaginatedScan extends Scan {
40+
41+
/**
42+
* Get a paginated iterator of Scan files for the current page.
43+
*
44+
* @param engine {@link Engine} instance to use in Delta Kernel.
45+
* @return iterator of {@link FilteredColumnarBatch}s for the current page.
46+
*/
47+
@Override
48+
PaginatedScanFilesIterator getScanFiles(Engine engine);
49+
}
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
/*
2+
* Copyright (2025) The Delta Lake Project Authors.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
package io.delta.kernel;
17+
18+
import io.delta.kernel.data.FilteredColumnarBatch;
19+
import io.delta.kernel.data.Row;
20+
import io.delta.kernel.utils.CloseableIterator;
21+
22+
/**
23+
* An iterator over {@link FilteredColumnarBatch}, each representing a batch of Scan Files in a
24+
* paginated scan. This iterator also exposes the page token that can be used to resume the scan
25+
* from the exact position current page ends in a subsequent request.
26+
*
27+
* <p>This interface extends {@link CloseableIterator} and should be closed when the iteration is
28+
* complete.
29+
*/
30+
public interface PaginatedScanFilesIterator extends CloseableIterator<FilteredColumnarBatch> {
31+
/**
32+
* Returns a page token representing the starting position of next page. This token is used to
33+
* resume the scan from the exact position current page ends in a subsequent request. Page token
34+
* also contains metadata for validation purpose, such as detecting changes in query parameters or
35+
* the underlying log files.
36+
*
37+
* <p>The page token represents the position of current iterator at the time it's called. If the
38+
* iterator is only partially consumed, the returned token will always point to the beginning of
39+
* the next unconsumed {@link FilteredColumnarBatch}. This method will return Option.empty() if
40+
* all data in the Scan is consumed (no more non-empty pages remain).
41+
*/
42+
Row getCurrentPageToken();
43+
}

kernel/kernel-api/src/main/java/io/delta/kernel/ScanBuilder.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,11 @@
1717
package io.delta.kernel;
1818

1919
import io.delta.kernel.annotation.Evolving;
20+
import io.delta.kernel.data.Row;
2021
import io.delta.kernel.engine.Engine;
2122
import io.delta.kernel.expressions.Predicate;
2223
import io.delta.kernel.types.StructType;
24+
import java.util.Optional;
2325

2426
/**
2527
* Builder to construct {@link Scan} object.
@@ -65,4 +67,14 @@ public interface ScanBuilder {
6567

6668
/** @return Build the {@link Scan instance} */
6769
Scan build();
70+
71+
/**
72+
* Build a Paginated Scan with a required page size and an optional page token.
73+
*
74+
* @param pageSize Maximum number of Scan Files to return in this page.
75+
* @param pageToken Optional page token representing the current scan position; empty to start
76+
* from the beginning.
77+
* @return A {@link PaginatedScan} configured for pagination.
78+
*/
79+
PaginatedScan buildPaginated(long pageSize, Optional<Row> pageToken);
6880
}

kernel/kernel-api/src/main/java/io/delta/kernel/internal/ScanBuilderImpl.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616

1717
package io.delta.kernel.internal;
1818

19+
import io.delta.kernel.PaginatedScan;
1920
import io.delta.kernel.Scan;
2021
import io.delta.kernel.ScanBuilder;
22+
import io.delta.kernel.data.Row;
2123
import io.delta.kernel.expressions.Predicate;
2224
import io.delta.kernel.internal.actions.Metadata;
2325
import io.delta.kernel.internal.actions.Protocol;
@@ -85,4 +87,9 @@ public Scan build() {
8587
dataPath,
8688
snapshotReport);
8789
}
90+
91+
@Override
92+
public PaginatedScan buildPaginated(long pageSize, Optional<Row> pageToken) {
93+
throw new UnsupportedOperationException("not implemented");
94+
}
8895
}

0 commit comments

Comments
 (0)