-
Notifications
You must be signed in to change notification settings - Fork 70
feat: basic table scan planning #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 8 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
e971cc4
feat: basic table scan planning
gty404 5fc6971
fix cpp lint
gty404 6a2cb74
fix build fail on windows
gty404 d71c26a
fix lint
gty404 c6c1a1f
fix some comments
gty404 cd07a0c
fix clang format
gty404 b7becc2
fix some comments
gty404 abfdfcd
Merge branch 'main' into table-scan
gty404 28043b1
Update src/iceberg/table_scan.h
gty404 fa25891
Update src/iceberg/table_scan.h
gty404 428651f
Merge branch 'main' into table-scan
gty404 812a545
fix comments
gty404 0f79c7c
Merge branch 'main' into table-scan
gty404 85802e9
Abstract TableScan and ScanTask
gty404 c7621b3
fix lint
gty404 e1267fc
fix lint
gty404 5248e22
fix lint
gty404 368e268
Merge branch 'main' into table-scan
gty404 29e8865
resolve some comments
gty404 ae560f3
remove Snapshot::kInitialSequenceNumber
gty404 0ff952b
resolve some comments
gty404 1b5d123
resolve some comments
gty404 3dc2b38
resolve some comments
gty404 702d0f4
resolve comments
gty404 4342887
resolve some comments
gty404 0d9b89e
Trigger CI
gty404 e4af0e7
Trigger CI
gty404 8705a82
resolve some comments
gty404 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| #include "iceberg/table_scan.h" | ||
|
|
||
| #include "iceberg/manifest_entry.h" | ||
| #include "iceberg/manifest_list.h" | ||
| #include "iceberg/manifest_reader.h" | ||
| #include "iceberg/schema.h" | ||
| #include "iceberg/schema_field.h" | ||
| #include "iceberg/snapshot.h" | ||
| #include "iceberg/table.h" | ||
| #include "iceberg/util/macros.h" | ||
|
|
||
| namespace iceberg { | ||
|
|
||
| TableScanBuilder::TableScanBuilder(const Table& table) : table_(table) {} | ||
|
|
||
| TableScanBuilder& TableScanBuilder::WithColumnNames( | ||
| const std::vector<std::string>& column_names) { | ||
| column_names_ = column_names; | ||
| return *this; | ||
| } | ||
|
|
||
| TableScanBuilder& TableScanBuilder::WithSnapshotId(int64_t snapshot_id) { | ||
| snapshot_id_ = snapshot_id; | ||
| return *this; | ||
| } | ||
|
|
||
| TableScanBuilder& TableScanBuilder::WithFilter( | ||
| const std::shared_ptr<Expression>& filter) { | ||
| filter_ = filter; | ||
| return *this; | ||
| } | ||
|
|
||
| Result<std::unique_ptr<TableScan>> TableScanBuilder::Build() { | ||
| std::shared_ptr<Snapshot> snapshot; | ||
| if (snapshot_id_) { | ||
| ICEBERG_ASSIGN_OR_RAISE(snapshot, table_.snapshot(*snapshot_id_)); | ||
| } else { | ||
| snapshot = table_.current_snapshot(); | ||
| } | ||
|
|
||
| std::shared_ptr<Schema> schema; | ||
| if (snapshot->schema_id) { | ||
| const auto& schemas = table_.schemas(); | ||
| if (auto it = schemas.find(*snapshot->schema_id); it != schemas.end()) { | ||
| schema = it->second; | ||
| } else { | ||
| return InvalidData("Schema {} in snapshot {} is not found", *snapshot->schema_id, | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| snapshot->snapshot_id); | ||
| } | ||
| } else { | ||
| schema = table_.schema(); | ||
| } | ||
|
|
||
| std::vector<int32_t> field_ids; | ||
| field_ids.reserve(column_names_.size()); | ||
| for (const auto& column_name : column_names_) { | ||
| auto field_opt = schema->GetFieldByName(column_name); | ||
| if (!field_opt) { | ||
| return InvalidArgument("Column {} not found in schema", column_name); | ||
| } | ||
| field_ids.emplace_back(field_opt.value().get().field_id()); | ||
| } | ||
|
|
||
| TableScan::ScanContext context{.snapshot = std::move(snapshot), | ||
| .schema = std::move(schema), | ||
| .field_ids = std::move(field_ids), | ||
| .filter = std::move(filter_)}; | ||
| return std::make_unique<TableScan>(std::move(context), table_.io()); | ||
| } | ||
|
|
||
| TableScan::TableScan(ScanContext context, std::shared_ptr<FileIO> file_io) | ||
| : context_(std::move(context)), file_io_(std::move(file_io)) {} | ||
|
|
||
| Result<std::vector<std::shared_ptr<FileScanTask>>> TableScan::PlanFiles() const { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto manifest_list_reader, | ||
| CreateManifestListReader(context_.snapshot->manifest_list)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto manifest_files, manifest_list_reader->Files()); | ||
|
|
||
| std::vector<std::shared_ptr<FileScanTask>> tasks; | ||
| for (const auto& manifest_file : manifest_files) { | ||
| ICEBERG_ASSIGN_OR_RAISE(auto manifest_reader, | ||
| CreateManifestReader(manifest_file->manifest_path)); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto manifests, manifest_reader->Entries()); | ||
|
|
||
| for (const auto& manifest : manifests) { | ||
| const auto& data_file = manifest->data_file; | ||
| tasks.emplace_back(std::make_shared<FileScanTask>( | ||
| data_file.file_path, 0, data_file.file_size_in_bytes, data_file.record_count, | ||
| data_file.content, data_file.file_format, context_.schema, context_.field_ids, | ||
| context_.filter)); | ||
| } | ||
| } | ||
| return tasks; | ||
| } | ||
|
|
||
| Result<std::unique_ptr<ManifestListReader>> TableScan::CreateManifestListReader( | ||
| const std::string& file_path) const { | ||
| return NotImplemented("manifest list reader"); | ||
| } | ||
|
|
||
| Result<std::unique_ptr<ManifestReader>> TableScan::CreateManifestReader( | ||
| const std::string& file_path) const { | ||
| return NotImplemented("manifest reader"); | ||
| } | ||
|
|
||
| } // namespace iceberg | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <string> | ||
| #include <vector> | ||
|
|
||
| #include "iceberg/manifest_entry.h" | ||
gty404 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| #include "iceberg/type_fwd.h" | ||
|
|
||
| namespace iceberg { | ||
|
|
||
| /// \brief Builder class for creating TableScan instances. | ||
| class ICEBERG_EXPORT TableScanBuilder { | ||
gty404 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| public: | ||
| /// \brief Constructs a TableScanBuilder for the given table. | ||
| /// \param table Reference to the table to scan. | ||
| explicit TableScanBuilder(const Table& table); | ||
|
|
||
| /// \brief Sets the snapshot ID to scan. | ||
| /// \param snapshot_id The ID of the snapshot. | ||
| /// \return Reference to the builder. | ||
| TableScanBuilder& WithSnapshotId(int64_t snapshot_id); | ||
|
|
||
| /// \brief Selects columns to include in the scan. | ||
gty404 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /// Defaults to none which means select all columns | ||
| /// \param column_names A list of column names. | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| /// \return Reference to the builder. | ||
| TableScanBuilder& WithColumnNames(const std::vector<std::string>& column_names); | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| /// \brief Applies a filter expression to the scan. | ||
| /// \param filter Filter expression to use. | ||
| /// \return Reference to the builder. | ||
| TableScanBuilder& WithFilter(const std::shared_ptr<Expression>& filter); | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| /// \brief Builds and returns a TableScan instance. | ||
| /// \return A Result containing the TableScan or an error. | ||
| Result<std::unique_ptr<TableScan>> Build(); | ||
|
|
||
| private: | ||
| const Table& table_; | ||
| std::vector<std::string> column_names_; | ||
gty404 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| std::optional<int64_t> snapshot_id_; | ||
| std::shared_ptr<Expression> filter_; | ||
| }; | ||
|
|
||
| /// \brief Represents a configured scan operation on a table. | ||
| class ICEBERG_EXPORT TableScan { | ||
| public: | ||
| /// \brief Scan context holding snapshot and scan-specific metadata. | ||
| struct ScanContext { | ||
| std::shared_ptr<Snapshot> snapshot; ///< Snapshot to scan. | ||
| std::shared_ptr<Schema> schema; ///< Table schema. | ||
| std::vector<int32_t> field_ids; ///< Field IDs of selected columns. | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| std::shared_ptr<Expression> filter; ///< Filter expression to apply. | ||
| }; | ||
|
|
||
| /// \brief Constructs a TableScan with the given context and file I/O. | ||
| /// \param context Scan context including snapshot, schema, and filter. | ||
| /// \param file_io File I/O instance for reading manifests and data files. | ||
| TableScan(ScanContext context, std::shared_ptr<FileIO> file_io); | ||
|
|
||
| /// \brief Plans the scan tasks by resolving manifests and data files. | ||
| /// | ||
| /// Returns a list of file scan tasks if successful. | ||
| /// \return A Result containing scan tasks or an error. | ||
| Result<std::vector<std::shared_ptr<FileScanTask>>> PlanFiles() const; | ||
|
|
||
| private: | ||
| /// \brief Creates a reader for the manifest list. | ||
| /// \param file_path Path to the manifest list file. | ||
| /// \return A Result containing the reader or an error. | ||
| Result<std::unique_ptr<ManifestListReader>> CreateManifestListReader( | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| const std::string& file_path) const; | ||
|
|
||
| /// \brief Creates a reader for a manifest file. | ||
| /// \param file_path Path to the manifest file. | ||
| /// \return A Result containing the reader or an error. | ||
| Result<std::unique_ptr<ManifestReader>> CreateManifestReader( | ||
| const std::string& file_path) const; | ||
|
|
||
| ScanContext context_; | ||
| std::shared_ptr<FileIO> file_io_; | ||
| }; | ||
|
|
||
| /// \brief Represents a task to scan a portion of a data file. | ||
| struct ICEBERG_EXPORT FileScanTask { | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| std::string file_path; ///< Path to the data file. | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| uint64_t start; ///< Start byte offset. | ||
| uint64_t length; ///< Length in bytes to scan. | ||
| std::optional<uint64_t> record_count; ///< Optional number of records. | ||
| DataFile::Content file_content; ///< Type of file content. | ||
| FileFormatType file_format; ///< Format of the data file. | ||
| std::shared_ptr<Schema> schema; ///< Table schema. | ||
| std::vector<int32_t> field_ids; ///< Field IDs to project. | ||
gty404 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| std::shared_ptr<Expression> filter; ///< Filter expression to apply. | ||
| }; | ||
|
|
||
| } // namespace iceberg | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.