-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-28952: TableFetcher to return Table objects instead of names #6020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -20,9 +20,11 @@ | |
| import com.google.common.annotations.VisibleForTesting; | ||
| import org.apache.hadoop.hive.common.TableName; | ||
| import org.apache.hadoop.hive.metastore.IMetaStoreClient; | ||
| import org.apache.hadoop.hive.metastore.TableIterable; | ||
| import org.apache.hadoop.hive.metastore.TableType; | ||
| import org.apache.hadoop.hive.metastore.Warehouse; | ||
| import org.apache.hadoop.hive.metastore.api.Database; | ||
| import org.apache.hadoop.hive.metastore.api.Table; | ||
| import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants; | ||
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
@@ -90,7 +92,7 @@ private void buildTableFilter(String tablePattern, List<String> conditions) { | |
| this.tableFilter = String.join(" and ", conditions); | ||
| } | ||
|
|
||
| public List<TableName> getTables() throws Exception { | ||
| public List<TableName> getTableNames() throws Exception { | ||
| List<TableName> candidates = new ArrayList<>(); | ||
|
|
||
| // if tableTypes is empty, then a list with single empty string has to specified to scan no tables. | ||
|
|
@@ -102,21 +104,47 @@ public List<TableName> getTables() throws Exception { | |
| List<String> databases = client.getDatabases(catalogName, dbPattern); | ||
|
|
||
| for (String db : databases) { | ||
| Database database = client.getDatabase(catalogName, db); | ||
| if (MetaStoreUtils.checkIfDbNeedsToBeSkipped(database)) { | ||
| LOG.debug("Skipping table under database: {}", db); | ||
| continue; | ||
| } | ||
| if (MetaStoreUtils.isDbBeingPlannedFailedOver(database)) { | ||
| LOG.info("Skipping table that belongs to database {} being failed over.", db); | ||
| continue; | ||
| } | ||
| List<String> tablesNames = client.listTableNamesByFilter(catalogName, db, tableFilter, -1); | ||
| List<String> tablesNames = getTableNamesForDatabase(catalogName, db); | ||
| tablesNames.forEach(tablesName -> candidates.add(TableName.fromString(tablesName, catalogName, db))); | ||
| } | ||
| return candidates; | ||
| } | ||
|
|
||
| public List<Table> getTables(int maxBatchSize) throws Exception { | ||
| List<Table> candidates = new ArrayList<>(); | ||
Neer393 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| // if tableTypes is empty, then a list with single empty string has to specified to scan no tables. | ||
| if (tableTypes.isEmpty()) { | ||
| LOG.info("Table fetcher returns empty list as no table types specified"); | ||
| return candidates; | ||
| } | ||
|
|
||
| List<String> databases = client.getDatabases(catalogName, dbPattern); | ||
|
|
||
| for (String db : databases) { | ||
Neer393 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| List<String> tablesNames = getTableNamesForDatabase(catalogName, db); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Neer393 I don't understand what have you optimized here. Also, have you considered the memory impact when loading everything into the heap? You could have iterated over TableIterable instead. I don't think that is a robust solution, it can potentially lead to OOM.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The earlier implementation had one msc call for getting table names and then one msc call each for getting the HMS table object for each table name. The newer implementation reduces the msc calls in a way that one msc call is made for getting all table names and then using TableIterable, the number of msc calls for getting table objects becomes So in the older implementation
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my earlier implementation I had the same proposal of directly getting table objects where I had implemented direct HMS API endpoint like
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in order to use batching, you need to have the table list to fetch - that's ok. However, instead of working with the batches, you load everything into memory.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay so for this fix should I create a new JIRA or as I am working on https://issues.apache.org/jira/browse/HIVE-28974 which is related to IcebergHouseKeeperService only should I attach the fix in this JIRA ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 Use the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me summarize the points.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK, it's a tradeoff between the number of msc calls and the space.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My concern was related to the fetch logic where we load all Hive table objects into the memory instead of using the batch iterator. We also make O(num-tables) calls to load an Iceberg table. Can we optimize here? Then we put those into a separate cache. Maybe iwe could use CachingCatalog instead ? |
||
| for (Table table : new TableIterable(client, db, tablesNames, maxBatchSize)) { | ||
| candidates.add(table); | ||
| } | ||
| } | ||
| return candidates; | ||
| } | ||
|
|
||
| private List<String> getTableNamesForDatabase(String catalogName, String dbName) throws Exception { | ||
| List<String> tableNames = new ArrayList<>(); | ||
| Database database = client.getDatabase(catalogName, dbName); | ||
| if (MetaStoreUtils.checkIfDbNeedsToBeSkipped(database)) { | ||
| LOG.debug("Skipping table under database: {}", dbName); | ||
| return tableNames; | ||
| } | ||
| if (MetaStoreUtils.isDbBeingPlannedFailedOver(database)) { | ||
| LOG.info("Skipping table that belongs to database {} being failed over.", dbName); | ||
| return tableNames; | ||
| } | ||
| tableNames = client.listTableNamesByFilter(catalogName, dbName, tableFilter, -1); | ||
| return tableNames; | ||
| } | ||
|
|
||
| public static class Builder { | ||
| private final IMetaStoreClient client; | ||
| private final String catalogName; | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recalled that we would like to retain the try-catch. We intentionally added it to avoid skipping everything when a single expiration fails.
See also: #5786 (comment)