Skip to content

Conversation

@wraymo
Copy link
Contributor

@wraymo wraymo commented Apr 4, 2025

Description

This PR introduces a CLP connector. The background and proposed implementation details are outlined in the associated RFC.

This PR implements one part of phase 1 of the proposed implementation, namely the Java implementation for the coordinator. The worker implementation will leverage Velox as the default query engine, so once the Velox PR is merged, we will submit another PR to this repo to add the necessary changes to presto-native-execution.

Like other connectors, we have created a presto-clp module and implemented all required connector interfaces. The plan optimizer will be a future PR.

The important classes in the connector are described below.


Core Classes in Java

ClpConfig

The configuration class for CLP. Currently, we support the following properties:

  • clp.metadata-expire-interval: Defines the time interval after which metadata entries are considered expired and removed from the cache.
  • clp.metadata-refresh-interval: Specifies how frequently metadata should be refreshed from the source to ensure up-to-date information.
  • clp.polymorphic-type-enabled: Enables or disables support for polymorphic types within CLP. This determines whether dynamic type resolution is allowed.
  • clp.metadata-provider-type: Defines the type of the metadata provider. It could be a database, a file-based store, or another external system. By default, we use MySQL.
  • clp.metadata-db-url: The connection URL for the metadata database, used when clp.metadata-provider-type is configured to use a database.
  • clp.metadata-db-name: The name of the metadata database.
  • clp.metadata-db-user: The database user with access to the metadata database.
  • clp.metadata-db-password: The password for the metadata database user.
  • clp.metadata-table-prefix: A prefix applied to table names in the metadata database.
  • clp.split-provider-type: Defines the type of split provider for query execution. By default, we use MySQL, and the connection parameters are the same as those for the metadata database.

ClpSchemaTree

A helper class for constructing a nested schema representation from CLP’s column definitions. It supports hierarchical column names (e.g., a.b.c), handles name/type conflicts when the clp.polymorphic-type-enabled option is enabled, and maps serialized CLP types to Presto types. The schema tree produces a flat list of ClpColumnHandle instances, including RowType for nested structures, making it suitable for dynamic or semi-structured data formats.

When polymorphic types are enabled, conflicting fields are given unique names by appending a type-specific suffix to the column name. For instance, if an integer field named "a" and a Varstring (CLP type) field named "a" coexist in CLP’s schema tree, they are represented as a_bigint and a_varchar in Presto. This approach ensures that such fields remain queryable while adhering to Presto’s constraints.


ClpMetadataProvider

An interface responsible for retrieving metadata from a specified source.

public interface ClpMetadataProvider {
    List<ClpColumnHandle> listColumnHandles(SchemaTableName schemaTableName);  
    List<ClpTableHandle> listTableNames(String schema);  
}

We provide a default implementation called ClpMySqlMetadataProvider, which uses two MySQL tables. One of these is the datasets table, defined with the schema shown below. Currently, we support only a single Presto schema named default, and this metadata table stores all table names, paths, and storage types associated with that Presto schema.

Column Name Data Type Constraints
name VARCHAR(255) PRIMARY KEY
archive_storage_type VARCHAR(4096) NOT NULL
archive_storage_directory VARCHAR(4096) NOT NULL

The second MySQL table contains column metadata, defined by the schema shown below. Each Presto table is associated with a corresponding MySQL table that stores metadata about its columns.

Column Name Data Type Constraints
name VARCHAR(512) NOT NULL
type TINYINT NOT NULL
Primary Key (name, type)

ClpSplitProvider

In CLP, an archive is the fundamental unit for searching, and we treat each archive as a Presto Split. This allows independent parallel searches across archives. The ClpSplitProvider interface, shown below, defines how to retrieve split information from a specified source:

public interface ClpSplitProvider {
    List<ClpSplit> listSplits(ClpTableLayoutHandle clpTableLayoutHandle);  
}

We provide a default implementation called ClpMySqlSplitProvider. It uses an archive table to store archive IDs associated with each table. The table below shows part of the schema (some irrelevant fields are omitted).

Column Name Data Type Constraints
pagination_id BIGINT AUTO_INCREMENT PRIMARY KRY
id VARCHAR(128) NOT NULL
... ... ...

By concatenating the table path (archive_storage_directory) and the archive ID (id), we can retrieve all split paths for a table.


ClpMetadata

This interface enables Presto to access various metadata. All requests are delegated to ClpMetadataProvider

For metadata management, it also maintains two caches and periodically refreshes the metadata.

  • columnHandleCache: A LoadingCache<SchemaTableName, List<ClpColumnHandle>> that maps a SchemaTableName to its corresponding list of ClpColumnHandle objects.
  • tableHandleCache: A LoadingCache<String, List<ClpTableHandle>> that maps a schema name (String) to a list of ClpTableHandle objects

Motivation and Context

See the associated RFC.

Impact

This module is independent from other modules and will not affect any existing functionality.

Test Plan

Unit tests are included in this PR, and we have also done end-to-end tests.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==
CLP Connector
* Add coordinator support for the CLP connector :pr:`24868`
* Add documentation for the CLP connector :doc:`../connector/clp` :pr:`24868`

@wraymo wraymo requested review from a team, elharo and steveburnett as code owners April 4, 2025 15:59
@wraymo wraymo requested a review from presto-oss April 4, 2025 15:59
@wraymo wraymo changed the title Add CLP connector Add coordinator implementation of CLP connector Apr 4, 2025
@wraymo wraymo force-pushed the clp_integration branch 2 times, most recently from f240d86 to 0566292 Compare April 4, 2025 16:22
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc, great work! Some nits, suggestions, and questions. Let me know what you think!

@Inject
public ClpSplitManager(ClpSplitProvider clpSplitProvider)
{
this.clpSplitProvider = clpSplitProvider;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.clpSplitProvider = clpSplitProvider;
this.clpSplitProvider = requireNonNull(clpSplitProvider, "clpSplitProvider is null");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, also fixed it in ClpPlanOptimizerProvider

@JsonCreator
public ClpTableHandle(@JsonProperty("schemaTableName") SchemaTableName schemaTableName)
{
this.schemaTableName = schemaTableName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull?

Copy link
Contributor Author

@wraymo wraymo Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to do check here, because it is called here and SchemaTableName is created here. Seems it 's unlikely to be a null

TableScanNode newTableScanNode = new TableScanNode(
tableScanNode.getSourceLocation(),
idAllocator.getNextId(),
new TableHandle(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to create a new table handle here? can we just used the tableHandle? is it just for the kqlQuery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JsonCreator
public ClpSplit(@JsonProperty("schemaTableName") @Nullable SchemaTableName schemaTableName,
@JsonProperty("archivePath") @Nullable String archivePath,
@JsonProperty("query") Optional<String> query)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this query just the kqlQuery? if so, I think it might be better to use the name kqlQuery instead for better consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will rename it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also renamed it in ClpTableLayoutHandle.

@wraymo wraymo requested review from beinan and steveburnett April 5, 2025 01:42
steveburnett
steveburnett previously approved these changes Apr 7, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, review new local doc build, resolved comments from my previous review. Thanks for your responses! The refactor of the doc in this update looks great.

@steveburnett
Copy link
Contributor

Thanks for the release note entry! You can link directly to the new documentation with this release note entry.

== RELEASE NOTES ==

CLP Connector Changes
* Add :doc:`../connector/clp`.

@wraymo
Copy link
Contributor Author

wraymo commented Apr 17, 2025

Hi @steveburnett, thanks for the review!
When you have a moment, could you take a look at the updated release note to see if it looks good to you? I’ve also updated the docs in this commit: e7b3bc6

@steveburnett
Copy link
Contributor

Hi @steveburnett, thanks for the review! When you have a moment, could you take a look at the updated release note to see if it looks good to you? I’ve also updated the docs in this commit: e7b3bc6

Thanks, I'll review the docs in a minute. I'll talk about the release note here:

  • We've automated the PR link in the release notes so you don't need to manually add the PR link in the release note entry now.
  • The link to the doc displays the title, so you don't need to write "CLP connector" which repeats it twice. The edit I have here for the second item compiles in a local doc build - I tested this in a local doc build to make sure - to display "Add documentation for the CLP Connector." with "CLP Connector" linking to the new page.
  • Should specify which connector is being affected in the Connector Changes heading.
== RELEASE NOTES ==

CLP Connector Changes
* Add coordinator support for the CLP connector.
* Add documentation for the :doc:`../connector/clp`.

Let me know if you have any questions!

steveburnett
steveburnett previously approved these changes Apr 17, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@steveburnett
Copy link
Contributor

@beinan, when you have time would you PTAL at @wraymo's responses to your review?

@wraymo
Copy link
Contributor Author

wraymo commented May 23, 2025

For test other modules. There's a failure in presto-delta module and the error message was something like

2025-05-22T18:34:47.597-0600 WARNING Failed to load checkpoint metadata from file file:/home/runner/work/presto/presto/presto-delta/target/test-classes/delta_v3/data-reader-nested-struct/_delta_log/_last_checkpoint`

For test, there's a timeout error

TestDistributedQueuesSchedulingPolicy.testWeightedScheduling » ThreadTimeout Method com.facebook.presto.execution.resourceGroups.db.TestDistributedQueuesSchedulingPolicy.testWeightedScheduling() didn't finish within the time-out 60000

Both are not related to the changes in this PR.

@wraymo wraymo force-pushed the clp_integration branch from d2fdfe9 to 290e5b8 Compare May 26, 2025 01:42
@steveburnett
Copy link
Contributor

@beinan, when you have time would you PTAL at @wraymo's responses to your review?

@wraymo wraymo force-pushed the clp_integration branch from 290e5b8 to ed55acc Compare June 16, 2025 17:49
@steveburnett
Copy link
Contributor

@beinan, when you have time would you PTAL at @wraymo's responses to your review?

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of high level questions:

  1. I don't see the connector optimizer mentioned in the RFC: is that to be done in a later phase? What is functional in this phase?
  2. Is there any way to add end to end tests? Most connectors run our abstract test suite, which uses canned queries against TPCH and DS, would that be possible here?

@wraymo
Copy link
Contributor Author

wraymo commented Aug 29, 2025

@tdcmeehan Thanks for the question! Here are the details:

  1. The connector optimizer is mentioned here. It will be added in a later phase. In this phase, the coordinator side is already usable with all required classes implemented, but we can’t run end-to-end tests with just this PR since the worker implementation is in Velox.

  2. We do have an open PR in our fork to add end-to-end tests for very basic queries: https://github.com/y-scope/presto/pull/54. I think it's doable to run TPCH and DS, but we need to prepare our dataset (archive) first since we don't support insertion.

@tdcmeehan tdcmeehan self-assigned this Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants