Skip to content

Conversation

@DanielZhu58
Copy link
Contributor

@DanielZhu58 DanielZhu58 commented Nov 18, 2025

What changes were proposed in this pull request?

To add a new StatisticsManagementTask.java to automatically delete the old stats.

Why are the changes needed?

To help reduce the old or stale stats.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual tests and unit tests.

For reviewers: What this PR does

This PR introduces a new “Statistics Management Task” in the Hive metastore which periodically auto-deletes stale column statistics, plus configuration knobs.

In MetastoreConf.java
Three new configuration variables are added:
STATISTICS_MANAGEMENT_TASK_FREQUENCY
Meaning: Controls how often the StatisticsManagementTask runs, for tables that have statistics.auto.deletion=true in their table properties.
STATISTICS_RETENTION_PERIOD
Meaning: The retention period for stats. If a table/partition’s stats are older than this, they become candidates for auto deletion.
STATISTICS_AUTO_DELETION

In StatisticsManagementTask.java
Defines a new StatisticsManagementTask implementing MetastoreTaskThread. Its purpose is to:
Fetch STATISTICS_RETENTION_PERIOD and STATISTICS_AUTO_DELETION from conf. If retention <= 0 or auto deletion is disabled, log and return.
Compute lastAnalyzedThreshold = (now - retentionMillis) / 1000 (in seconds).
Use HMSHandler.getMSForConf(conf) to get RawStore and a PersistenceManager, then query MTableColumnStatistics rows where lastAnalyzed < threshold.
In short, this class implements a background cleanup task that scans MTableColumnStatistics for stale entries and deletes them via the metastore client.

In BenchmarkTool.java
BenchmarkTool can now benchmark the new statistics management task for different numbers of tables.

In HMSBenchmarks.java Test
Constructs a dedicated database name and table prefix based on tableCount and BenchData.
Gets an HMSClient and instantiates a StatisticsManagementTask.
Configures the client Hadoop conf:
hive.metastore.uris = metastore URI
metastore.statistics.management.database.pattern = dbName (so the task focuses on this DB)
Sets the task’s conf and creates the database and tableCount tables:
Simulates old stats:
For each partition, sets lastAnalyzed to now - 400 days in the partition parameters and alters the partition.
Post-run assertion:
Re-scans all partitions; if any partition parameters still contain lastAnalyzed, it throws an AssertionError("Partition stats not deleted for table: " + tableName).
In other words, this is an end-to-end microbenchmark for the new StatisticsManagementTask that both measures performance and verifies that “old” partition stats are actually cleaned up.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 4, 2025

@Override
public void setConf(Configuration configuration) {
// we modify conf in setupConf(), so we make a copy
conf = new Configuration(configuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why create a new conf object? Why not directly assign it? IMO, it is a costly operation to create conf object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Changed.

// delete all column stats
@Override
public void run() {
if (LOG.isDebugEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need this if condition, log.debug() will print to debug logs if debug logging is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

PersistenceManager pm = ((ObjectStore) ms).getPersistenceManager();
Query q = pm.newQuery(MTableColumnStatistics.class);
q.setFilter(filter);
q.declareParameters(paramStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also set result fields before querying like this: https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1933? That way we can avoid fetching heavy weight 'Mtable' object and other useless stats in the MTableColumnStatistics class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Changed.

String tblName = stat.getTable().getTableName();
Map<String, String> tblParams = stat.getTable().getParameters();
if (tblParams != null && tblParams.getOrDefault(STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY, null) != null) {
LOG.info("Skipping table {}.{} due to exclude property.", dbName, tblName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this message more intuitive.
nit: "Skipping auto deletion of stats for table {}.{} due to {} property being set on the table.", dbName, tblName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

final StatisticsManagementTask statsTask = new StatisticsManagementTask();
final FileSystem fs;
try {
fs = FileSystem.get(client.getHadoopConf());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

}

Runnable preRun = () -> {
System.out.println("Preparing for benchmark...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace this with log.info or log.debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

try {
fs = FileSystem.get(client.getHadoopConf());
client.getHadoopConf().set("hive.metastore.uris", client.getServerURI().toString());
client.getHadoopConf().set("metastore.statistics.management.database.pattern", dbName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this config? we don't have it in our metastoreconf right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this config key is not in MetastoreConf .
It was intended as a DB filter for StatisticsManagementTask. We can remove it from the HMSBenchmarks.java

Map<String, String> params = partition.getParameters();
// to manually change the "lastAnalyzed" to an old time, ex. 400 days
params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(400)));
client.alterPartition(dbName, tableName, partition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could improve this by calling alterPartitions() and changing all partitions in one go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

} catch (Exception e) {
throw new RuntimeException(e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also test a scenario where auto stats are skip by setting the table property and verify that stats exists even after running stats management task.
I think it would be better to test it out in a different test class instead of benchmark test

@sonarqubecloud
Copy link

Copy link
Contributor

@saihemanth-cloudera saihemanth-cloudera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect rebase. The latest PR inadvertently pulled in 32 commits into your patch. Please correct it.

@DanielZhu58
Copy link
Contributor Author

During the rebase and sync process, I accidentally closed this PR.
This is the new PR link: #6264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants