refactor: analyze table #18514

zhyass · 2025-08-10T18:14:04Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Summary:

This PR improves the ANALYZE TABLE process by directly merging pre-collected block-level HyperLogLog (HLL) data, instead of relying on query-based calculations.

Previously, HLL statistics were calculated dynamically during ANALYZE TABLE through queries.
Now, the block-level HLL data is directly used to merge statistics, reducing calculation time.
Reduces the cost of ANALYZE TABLE by leveraging existing statistics, enhancing performance, especially for large tables.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2025-08-11T20:23:23Z

Docker Image for PR

tag: pr-18514-858657c-1754943720

note: this image tag is only available for internal use.

github-actions · 2025-08-11T21:59:13Z

ClickBench Report

github-actions · 2025-08-12T03:43:16Z

Docker Image for PR

tag: pr-18514-858657c-1754970108

note: this image tag is only available for internal use.

github-actions · 2025-08-12T05:10:25Z

ClickBench Report

sundy-li

LGTM

BohuTANG · 2025-08-12T09:58:46Z

NDV (Number of Distinct Values) Statistics Accuracy Comparison Report

Data Source and Methodology

Estimated NDV: From system.columns.ndv field, collected via ANALYZE TABLE operations
Actual NDV: Calculated using COUNT(DISTINCT column_name) queries against real data
Comparison Scope: 21 key columns across major TPC-DS tables
Database Versions: tpcds_100 vs tpcds_100_v2 (different statistics collection methods)

Complete NDV Accuracy Comparison Table

No.	Table.Column	Actual NDV	tpcds_100 Estimated	tpcds_100_v2 Estimated	tpcds_100 Error %	tpcds_100_v2 Error %	tpcds_100 Grade	tpcds_100_v2 Grade
1	catalog_sales.cs_bill_customer_sk	1,999,335	2,000,000	1,785,814	0.03%	10.68%	Excellent	Poor
2	catalog_sales.cs_item_sk	204,000	203,827	190,915	0.08%	6.41%	Excellent	Fair
3	catalog_sales.cs_quantity	100	92	100	8.00%	0.00%	Fair	Perfect
4	customer.c_birth_country	211	219	216	3.79%	2.37%	Good	Good
5	customer.c_birth_month	12	12	12	0.00%	0.00%	Perfect	Perfect
6	customer.c_birth_year	69	64	64	7.25%	7.25%	Fair	Fair
7	customer.c_customer_id	2,000,000	1,858,387	2,275,552	7.08%	13.78%	Fair	Poor
8	customer.c_customer_sk	2,000,000	2,000,000	1,785,814	0.00%	10.71%	Perfect	Poor
9	date_dim.d_date	73,049	73,049	66,817	0.00%	8.53%	Perfect	Fair
10	date_dim.d_date_sk	73,049	68,193	73,049	6.65%	0.00%	Fair	Perfect
11	date_dim.d_month_seq	2,401	2,401	2,401	0.00%	0.00%	Perfect	Perfect
12	date_dim.d_year	201	201	184	0.00%	8.46%	Perfect	Fair
13	item.i_brand	712	698	782	1.97%	9.83%	Good	Fair
14	item.i_category	10	10	9	0.00%	10.00%	Perfect	Poor
15	item.i_class	99	102	110	3.03%	11.11%	Good	Poor
16	item.i_item_id	102,000	106,922	110,991	4.83%	8.81%	Good	Fair
17	item.i_item_sk	204,000	203,827	190,915	0.08%	6.41%	Excellent	Fair
18	store_sales.ss_customer_sk	1,999,984	2,000,000	1,785,814	0.00%	10.71%	Excellent	Poor
19	store_sales.ss_item_sk	204,000	203,827	190,915	0.08%	6.41%	Excellent	Fair
20	store_sales.ss_quantity	100	92	100	8.00%	0.00%	Fair	Perfect
21	store_sales.ss_store_sk	201	201	211	0.00%	4.98%	Perfect	Good

Key Findings and Analysis

Significant Performance Differences Detected

1. tpcds_100 Shows Superior Accuracy Overall

Perfect/Excellent Grades: 11 out of 21 columns (52.4%)
Average Error Rate: 2.37%
Poor Grades: Only 0 columns

2. tpcds_100_v2 Shows Mixed Performance

Perfect/Excellent Grades: 4 out of 21 columns (19.0%)
Average Error Rate: 6.71%
Poor Grades: 6 columns (28.6%) - Significant concern

Critical Problem Areas in tpcds_100_v2

Customer Keys: Major underestimation in customer-related surrogate keys
- cs_bill_customer_sk: 10.68% error
- c_customer_sk: 10.71% error
- ss_customer_sk: 10.71% error
Item Classifications: Poor estimation of categorical data
- i_category: 10.00% error (9 vs 10 actual)
- i_class: 11.11% error (110 vs 99 actual)
Customer Identifiers: Significant overestimation
- c_customer_id: 13.78% error (2.28M vs 2M actual)

Areas Where tpcds_100_v2 Performs Better

Quantity Fields: Perfect accuracy on discrete value ranges
Some Date Keys: Better estimation on certain temporal dimensions

Statistical Quality Impact

Query Optimization Reliability: tpcds_100 provides more reliable cardinality estimates
Join Cost Estimation: tpcds_100_v2's poor customer key estimates could lead to suboptimal join ordering
Memory Planning: Underestimated NDV values may cause inadequate hash table sizing

Conclusion

tpcds_100 demonstrates significantly superior NDV estimation accuracy compared to tpcds_100_v2, with:

2.8x better average accuracy (2.37% vs 6.71% error rate)
Zero poor-grade estimations vs 6 poor-grade columns in v2
More reliable cardinality foundation for query optimization

dantengsky · 2025-08-12T13:02:02Z

src/query/service/src/interpreters/interpreter_table_analyze.rs

+            let prev_snapshot_id = snapshot.prev_snapshot_id.map(|(id, _)| id);
+            if Some(table_statistics.snapshot_id) == prev_snapshot_id {
+                return Ok(PipelineBuildResult::create());
+            }


Shall we compare with the current snapshot id?

zhyass marked this pull request as draft August 10, 2025 18:14

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 10, 2025

zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025

zhyass and others added 7 commits August 12, 2025 02:33

update

923fc36

update

31468eb

update

6c43a81

refactor

4556716

fix test

d2d14be

fix test

f10fee3

fix test

b8cef49

zhyass force-pushed the feat_analyze branch from 92cb8bf to b8cef49 Compare August 11, 2025 19:06

fix test

bdf149c

databendlabs deleted a comment from github-actions bot Aug 11, 2025

zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025

fix test

5d9a205

zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025

zhyass marked this pull request as ready for review August 12, 2025 03:12

zhyass requested review from sundy-li, youngsofun and dantengsky August 12, 2025 03:13

databendlabs deleted a comment from github-actions bot Aug 12, 2025

sundy-li approved these changes Aug 12, 2025

View reviewed changes

dantengsky reviewed Aug 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: analyze table #18514

refactor: analyze table #18514

zhyass commented Aug 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

sundy-li left a comment

Uh oh!

BohuTANG commented Aug 12, 2025

Uh oh!

dantengsky Aug 12, 2025

Uh oh!

Uh oh!

refactor: analyze table #18514

Are you sure you want to change the base?

refactor: analyze table #18514

Conversation

zhyass commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary:

Tests

Type of change

Uh oh!

github-actions bot commented Aug 11, 2025

Docker Image for PR

Uh oh!

github-actions bot commented Aug 11, 2025

ClickBench Report

Uh oh!

github-actions bot commented Aug 12, 2025

Docker Image for PR

Uh oh!

github-actions bot commented Aug 12, 2025

ClickBench Report

Uh oh!

sundy-li left a comment

Choose a reason for hiding this comment

Uh oh!

BohuTANG commented Aug 12, 2025

NDV (Number of Distinct Values) Statistics Accuracy Comparison Report

Data Source and Methodology

Complete NDV Accuracy Comparison Table

Key Findings and Analysis

Significant Performance Differences Detected

1. tpcds_100 Shows Superior Accuracy Overall

2. tpcds_100_v2 Shows Mixed Performance

Critical Problem Areas in tpcds_100_v2

Areas Where tpcds_100_v2 Performs Better

Statistical Quality Impact

Conclusion

Uh oh!

dantengsky Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhyass commented Aug 10, 2025 •

edited

Loading