[#9429] improvement(core): Use join to improve query performance on table, topic and so on. #9430
[#9429] improvement(core): Use join to improve query performance on table, topic and so on. #9430yuqi1129 merged 53 commits intoapache:mainfrom
Conversation
…e, catalog, schema, and table tables
There was a problem hiding this comment.
Pull request overview
This PR introduces performance optimizations for table metadata queries through database indexing and a new query method, plus adds Caffeine-based caching for Metalake entities. However, there are several critical issues that must be addressed before merging.
Key changes:
- Added composite indexes on
(name, deleted_at)columns across metalake, catalog, schema, and table metadata tables - Introduced
selectTableByFullQualifiedNamemethod to query tables using fully qualified names with JOIN operations - Added Caffeine cache for
BaseMetalakeobjects inMetalakeManagerwith automatic expiry and cleanup
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/h2/schema-1.1.0-h2.sql | Adds composite indexes on name and deleted_at columns; contains critical bug on line 66 |
| scripts/h2/upgrade-1.0.0-to-1.1.0-h2.sql | Migration script to add indexes for existing H2 databases |
| scripts/mysql/schema-1.1.0-mysql.sql | Adds composite indexes on name and deleted_at columns; contains critical SQL syntax error on line 83 |
| scripts/mysql/upgrade-1.0.0-to-1.1.0-mysql.sql | Migration script to add indexes for existing MySQL databases |
| scripts/postgresql/schema-1.1.0-postgresql.sql | Adds composite indexes on name and deleted_at columns |
| scripts/postgresql/upgrade-1.0.0-to-1.1.0-postgresql.sql | Migration script to add indexes for existing PostgreSQL databases |
| core/src/main/java/org/apache/gravitino/metalake/MetalakeManager.java | Implements Caffeine cache for metalakes; has error handling and resource cleanup issues |
| core/src/main/java/org/apache/gravitino/storage/relational/mapper/TableMetaMapper.java | Adds new mapper method for querying by fully qualified name |
| core/src/main/java/org/apache/gravitino/storage/relational/mapper/TableMetaSQLProviderFactory.java | Adds factory method for the new query |
| core/src/main/java/org/apache/gravitino/storage/relational/mapper/provider/base/TableMetaBaseSQLProvider.java | Implements JOIN-based query; contains critical SQL injection vulnerability |
| core/src/main/java/org/apache/gravitino/storage/relational/service/TableMetaService.java | Updates table lookup to use new optimized query method |
Comments suppressed due to low confidence (1)
core/src/main/java/org/apache/gravitino/metalake/MetalakeManager.java:90
- The
close()method does not clean up themetalakeCacheor shut down the scheduled executor service used by the cache's scheduler. This can lead to resource leaks when the MetalakeManager is closed. AddmetalakeCache.invalidateAll()and shutdown the scheduler's executor service in the close method.
@Override
public void close() {
// do nothing
}
core/src/main/java/org/apache/gravitino/metalake/MetalakeManager.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/gravitino/storage/relational/service/TableMetaService.java
Show resolved
Hide resolved
...a/org/apache/gravitino/storage/relational/mapper/provider/base/TableMetaBaseSQLProvider.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/gravitino/metalake/MetalakeManager.java
Outdated
Show resolved
Hide resolved
...a/org/apache/gravitino/storage/relational/mapper/provider/base/TableMetaBaseSQLProvider.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/gravitino/storage/TestEntityStorage.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/gravitino/storage/relational/service/SchemaMetaService.java
Show resolved
Hide resolved
mchades
left a comment
There was a problem hiding this comment.
no more comments.
would you like to review again?
|
|
||
| testImplementation(libs.awaitility) | ||
| testImplementation(libs.slf4j.api) | ||
| testImplementation(libs.awaitility) |
There was a problem hiding this comment.
This seems to be the same as line 152. Is it intentional?
There was a problem hiding this comment.
It should be a mistake. Could u raise a pull request to fix this issue?
There was a problem hiding this comment.
I am still going over the code.
This change can be bundled with other changes.
| AND sm.schema_name = #{schemaName} | ||
| AND sm.deleted_at = 0 | ||
| LEFT JOIN | ||
| %s fm ON sm.schema_id = fm.schema_id |
There was a problem hiding this comment.
Do we support schema-less or schema-optional namespaces ?
In those cases, we shouldn't join using schema id.
Also, there should be fm.fileset_name = #{filesetName} clause to narrow the return values.
There was a problem hiding this comment.
Schema is required for table, topic and fileset.
| AND fm.current_version = vi.version | ||
| AND vi.deleted_at = 0 | ||
| WHERE | ||
| mm.metalake_name = #{metalakeName} |
There was a problem hiding this comment.
If we want to be flexible on schema, we can add:
( sm.schema_name = #{schemaName} OR #{schemaName} IS NULL )
fm.fileset_name = #{filesetName} narrows the selection quickly.
…e on table, topic and so on. (apache#9430) ### What changes were proposed in this pull request? This pull request introduces major performance and functionality improvements to the Metalake and table metadata management system. The most significant change is the addition of a Caffeine-based cache for Metalake entities, which reduces redundant storage access and improves efficiency. Additionally, it adds a new method for querying tables by their fully qualified names and optimizes database schema indexing for faster lookups. ### Metalake caching and management improvements * Introduced a static Caffeine cache (`metalakeCache`) in `MetalakeManager` to store `BaseMetalake` objects, with automatic expiry and scheduled cleanup. This cache is now used in `metalakeInUse`, `loadMetalake`, and is invalidated on alter, drop, enable, and disable operations to ensure consistency. [[1]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R23-R36) [[2]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R74-R86) [[3]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0L87-R107) [[4]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R138-R155) [[5]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R194-R209) [[6]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R301-R302) [[7]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R353-R354) [[8]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R381) [[9]](diffhunk://#diff-21ed56786d4199a9ebee9969aae3195dff5037135c738ff4b8ba2072fa72d6a0R416) ### Table metadata querying enhancements * Added a new method `selectTableByFullQualifiedName` to `TableMetaMapper`, its SQL provider, and the service layer, allowing retrieval of table metadata using metalake, catalog, schema, and table names directly, improving query flexibility and performance. [[1]](diffhunk://#diff-464816dcd4652ac061d317d0ed9ab3a98db358f2db517ce70bbf61fa35624db9R59-R67) [[2]](diffhunk://#diff-e160fd550934e3ac047abde64991c6ab5edc44602bf8cfd8587596f10b0b58efR72-R80) [[3]](diffhunk://#diff-19aff10a8b33dded8f085dc93250ba296ca800b5a68630a22537e83b0bd9b730R236-R292) [[4]](diffhunk://#diff-c0bc429f18cd462c3a3c1fe65c7da0966c338a96737d69b6e32b17257d97a52cL88-L92) [[5]](diffhunk://#diff-c0bc429f18cd462c3a3c1fe65c7da0966c338a96737d69b6e32b17257d97a52cR347-R363) ### Database schema and indexing optimizations * Added composite indexes (`idx_name_da`) on the `metalake_name`, `catalog_name`, `schema_name`, and `table_name` columns (with `deleted_at`) in both H2 and MySQL schema scripts, significantly improving query performance for lookups by name. [[1]](diffhunk://#diff-3ca2c363f5538fb4838a31233c4c858b4928e9ec65b5b86bf35a8c5f653133bfR31) [[2]](diffhunk://#diff-3ca2c363f5538fb4838a31233c4c858b4928e9ec65b5b86bf35a8c5f653133bfR49) [[3]](diffhunk://#diff-3ca2c363f5538fb4838a31233c4c858b4928e9ec65b5b86bf35a8c5f653133bfR66) [[4]](diffhunk://#diff-3ca2c363f5538fb4838a31233c4c858b4928e9ec65b5b86bf35a8c5f653133bfR83) [[5]](diffhunk://#diff-6fac45e3da22ec7d24ca5ec0b78c5a03d653fabd65dd8040b2c59404fc49fb21R33-R38) [[6]](diffhunk://#diff-ac6c56c7547bd70d3951716f2b543beea23964b906792b527b833e6229404f70R31) [[7]](diffhunk://#diff-ac6c56c7547bd70d3951716f2b543beea23964b906792b527b833e6229404f70R48) [[8]](diffhunk://#diff-ac6c56c7547bd70d3951716f2b543beea23964b906792b527b833e6229404f70R64) [[9]](diffhunk://#diff-ac6c56c7547bd70d3951716f2b543beea23964b906792b527b833e6229404f70R83) ### Why are the changes needed? To improve performance Fix: apache#9429 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Existing test



What changes were proposed in this pull request?
This pull request introduces major performance and functionality improvements to the Metalake and table metadata management system. The most significant change is the addition of a Caffeine-based cache for Metalake entities, which reduces redundant storage access and improves efficiency. Additionally, it adds a new method for querying tables by their fully qualified names and optimizes database schema indexing for faster lookups.
Metalake caching and management improvements
metalakeCache) inMetalakeManagerto storeBaseMetalakeobjects, with automatic expiry and scheduled cleanup. This cache is now used inmetalakeInUse,loadMetalake, and is invalidated on alter, drop, enable, and disable operations to ensure consistency. [1] [2] [3] [4] [5] [6] [7] [8] [9]Table metadata querying enhancements
selectTableByFullQualifiedNametoTableMetaMapper, its SQL provider, and the service layer, allowing retrieval of table metadata using metalake, catalog, schema, and table names directly, improving query flexibility and performance. [1] [2] [3] [4] [5]Database schema and indexing optimizations
idx_name_da) on themetalake_name,catalog_name,schema_name, andtable_namecolumns (withdeleted_at) in both H2 and MySQL schema scripts, significantly improving query performance for lookups by name. [1] [2] [3] [4] [5] [6] [7] [8] [9]Why are the changes needed?
To improve performance
Fix: #9429
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Existing test