-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Add coordinator implementation of CLP connector #24868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
f240d86 to
0566292
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc, great work! Some nits, suggestions, and questions. Let me know what you think!
| @Inject | ||
| public ClpSplitManager(ClpSplitProvider clpSplitProvider) | ||
| { | ||
| this.clpSplitProvider = clpSplitProvider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| this.clpSplitProvider = clpSplitProvider; | |
| this.clpSplitProvider = requireNonNull(clpSplitProvider, "clpSplitProvider is null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, also fixed it in ClpPlanOptimizerProvider
| @JsonCreator | ||
| public ClpTableHandle(@JsonProperty("schemaTableName") SchemaTableName schemaTableName) | ||
| { | ||
| this.schemaTableName = schemaTableName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requireNonNull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TableScanNode newTableScanNode = new TableScanNode( | ||
| tableScanNode.getSourceLocation(), | ||
| idAllocator.getNextId(), | ||
| new TableHandle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to create a new table handle here? can we just used the tableHandle? is it just for the kqlQuery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's because we create a new ClpTableLayoutHandle with the KQL query. Similar to https://github.com/prestodb/presto/blob/master/presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/optimization/JdbcComputePushdown.java#L128 and
presto/presto-druid/src/main/java/com/facebook/presto/druid/DruidPlanOptimizer.java
Line 162 in c875296
| TableHandle newTableHandle = new TableHandle( |
| @JsonCreator | ||
| public ClpSplit(@JsonProperty("schemaTableName") @Nullable SchemaTableName schemaTableName, | ||
| @JsonProperty("archivePath") @Nullable String archivePath, | ||
| @JsonProperty("query") Optional<String> query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this query just the kqlQuery? if so, I think it might be better to use the name kqlQuery instead for better consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will rename it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also renamed it in ClpTableLayoutHandle.
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, review new local doc build, resolved comments from my previous review. Thanks for your responses! The refactor of the doc in this update looks great.
|
Thanks for the release note entry! You can link directly to the new documentation with this release note entry. |
|
Hi @steveburnett, thanks for the review! |
Thanks, I'll review the docs in a minute. I'll talk about the release note here:
Let me know if you have any questions! |
steveburnett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
86eccb1 to
d2fdfe9
Compare
|
For For Both are not related to the changes in this PR. |
…ataProvider and ClpMySqlSplitProvider
…mizer and unit tests for a future PR
tdcmeehan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of high level questions:
- I don't see the connector optimizer mentioned in the RFC: is that to be done in a later phase? What is functional in this phase?
- Is there any way to add end to end tests? Most connectors run our abstract test suite, which uses canned queries against TPCH and DS, would that be possible here?
|
@tdcmeehan Thanks for the question! Here are the details:
|
Description
This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.
This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to
presto-native-execution.Like other connectors, we have created a
presto-clpmodule and implemented all required connector interfaces. The plan optimizer will be a future PR.The important classes in the connector are described below.
Core Classes in Java
ClpConfigThe configuration class for CLP. Currently, we support the following properties:
clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.clp.metadata-db-url: The connection URL for the metadata database, used whenclp.metadata-provider-typeis configured to use a database.clp.metadata-db-name: The name of the metadata database.clp.metadata-db-user: The database user with access to the metadata database.clp.metadata-db-password: The password for the metadata database user.clp.metadata-table-prefix: A prefix applied to table names in the metadata database.clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.ClpSchemaTreeA helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g.,
a.b.c), handles name/type conflicts when theclp.polymorphic-type-enabledoption is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list ofClpColumnHandleinstances, includingRowTypefor nested structures, making it suitable for dynamic or semi-structured data formats.When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a
Varstring(CLP type) field named "a" coexist in CLP’s schema tree, they are represented asa_bigintanda_varcharin Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.ClpMetadataProviderAn interface responsible for retrieving metadata from a specified source.
We provide a default implementation called
ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema nameddefault, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.nameVARCHAR(255)PRIMARY KEYarchive_storage_typeVARCHAR(4096)NOT NULLarchive_storage_directoryVARCHAR(4096)NOT NULLThe second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.
nameVARCHAR(512)NOT NULLtypeTINYINTNOT NULLname,type)ClpSplitProviderIn CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The
ClpSplitProviderinterface, shown below, defines how to retrieve split information from a specified source:We provide a default implementation called
ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).pagination_idBIGINTAUTO_INCREMENT PRIMARY KRYidVARCHAR(128)NOT NULLBy concatenating the table path (
archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.ClpMetadataThis interface enables Presto to access various metadata. All requests are delegated to
ClpMetadataProviderFor metadata management, it also maintains two caches and periodically refreshes the metadata.
columnHandleCache: ALoadingCache<SchemaTableName, List<ClpColumnHandle>>that maps aSchemaTableNameto its corresponding list ofClpColumnHandleobjects.tableHandleCache: ALoadingCache<String, List<ClpTableHandle>>that maps a schema name (String) to a list ofClpTableHandleobjectsMotivation and Context
See the associated RFC.
Impact
This module is independent from other modules and will not affect any existing functionality.
Test Plan
Unit tests are included in this PR, and we have also done end-to-end tests.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.