Skip to content

refactor: analyze table #18514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

refactor: analyze table #18514

wants to merge 9 commits into from

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Aug 10, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Summary:

This PR improves the ANALYZE TABLE process by directly merging pre-collected block-level HyperLogLog (HLL) data, instead of relying on query-based calculations.

  • Previously, HLL statistics were calculated dynamically during ANALYZE TABLE through queries.

  • Now, the block-level HLL data is directly used to merge statistics, reducing calculation time.

  • Reduces the cost of ANALYZE TABLE by leveraging existing statistics, enhancing performance, especially for large tables.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass marked this pull request as draft August 10, 2025 18:14
@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 10, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18514-858657c-1754943720

note: this image tag is only available for internal use.

@zhyass zhyass marked this pull request as ready for review August 12, 2025 03:12
Copy link
Contributor

Docker Image for PR

  • tag: pr-18514-858657c-1754970108

note: this image tag is only available for internal use.

@databendlabs databendlabs deleted a comment from github-actions bot Aug 12, 2025
Copy link
Member

@sundy-li sundy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BohuTANG
Copy link
Member

NDV (Number of Distinct Values) Statistics Accuracy Comparison Report

Data Source and Methodology

  • Estimated NDV: From system.columns.ndv field, collected via ANALYZE TABLE operations
  • Actual NDV: Calculated using COUNT(DISTINCT column_name) queries against real data
  • Comparison Scope: 21 key columns across major TPC-DS tables
  • Database Versions: tpcds_100 vs tpcds_100_v2 (different statistics collection methods)

Complete NDV Accuracy Comparison Table

No. Table.Column Actual NDV tpcds_100 Estimated tpcds_100_v2 Estimated tpcds_100 Error % tpcds_100_v2 Error % tpcds_100 Grade tpcds_100_v2 Grade
1 catalog_sales.cs_bill_customer_sk 1,999,335 2,000,000 1,785,814 0.03% 10.68% Excellent Poor
2 catalog_sales.cs_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
3 catalog_sales.cs_quantity 100 92 100 8.00% 0.00% Fair Perfect
4 customer.c_birth_country 211 219 216 3.79% 2.37% Good Good
5 customer.c_birth_month 12 12 12 0.00% 0.00% Perfect Perfect
6 customer.c_birth_year 69 64 64 7.25% 7.25% Fair Fair
7 customer.c_customer_id 2,000,000 1,858,387 2,275,552 7.08% 13.78% Fair Poor
8 customer.c_customer_sk 2,000,000 2,000,000 1,785,814 0.00% 10.71% Perfect Poor
9 date_dim.d_date 73,049 73,049 66,817 0.00% 8.53% Perfect Fair
10 date_dim.d_date_sk 73,049 68,193 73,049 6.65% 0.00% Fair Perfect
11 date_dim.d_month_seq 2,401 2,401 2,401 0.00% 0.00% Perfect Perfect
12 date_dim.d_year 201 201 184 0.00% 8.46% Perfect Fair
13 item.i_brand 712 698 782 1.97% 9.83% Good Fair
14 item.i_category 10 10 9 0.00% 10.00% Perfect Poor
15 item.i_class 99 102 110 3.03% 11.11% Good Poor
16 item.i_item_id 102,000 106,922 110,991 4.83% 8.81% Good Fair
17 item.i_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
18 store_sales.ss_customer_sk 1,999,984 2,000,000 1,785,814 0.00% 10.71% Excellent Poor
19 store_sales.ss_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
20 store_sales.ss_quantity 100 92 100 8.00% 0.00% Fair Perfect
21 store_sales.ss_store_sk 201 201 211 0.00% 4.98% Perfect Good

Key Findings and Analysis

Significant Performance Differences Detected

1. tpcds_100 Shows Superior Accuracy Overall

  • Perfect/Excellent Grades: 11 out of 21 columns (52.4%)
  • Average Error Rate: 2.37%
  • Poor Grades: Only 0 columns

2. tpcds_100_v2 Shows Mixed Performance

  • Perfect/Excellent Grades: 4 out of 21 columns (19.0%)
  • Average Error Rate: 6.71%
  • Poor Grades: 6 columns (28.6%) - Significant concern

Critical Problem Areas in tpcds_100_v2

  1. Customer Keys: Major underestimation in customer-related surrogate keys

    • cs_bill_customer_sk: 10.68% error
    • c_customer_sk: 10.71% error
    • ss_customer_sk: 10.71% error
  2. Item Classifications: Poor estimation of categorical data

    • i_category: 10.00% error (9 vs 10 actual)
    • i_class: 11.11% error (110 vs 99 actual)
  3. Customer Identifiers: Significant overestimation

    • c_customer_id: 13.78% error (2.28M vs 2M actual)

Areas Where tpcds_100_v2 Performs Better

  • Quantity Fields: Perfect accuracy on discrete value ranges
  • Some Date Keys: Better estimation on certain temporal dimensions

Statistical Quality Impact

  1. Query Optimization Reliability: tpcds_100 provides more reliable cardinality estimates
  2. Join Cost Estimation: tpcds_100_v2's poor customer key estimates could lead to suboptimal join ordering
  3. Memory Planning: Underestimated NDV values may cause inadequate hash table sizing

Conclusion

tpcds_100 demonstrates significantly superior NDV estimation accuracy compared to tpcds_100_v2, with:

  • 2.8x better average accuracy (2.37% vs 6.71% error rate)
  • Zero poor-grade estimations vs 6 poor-grade columns in v2
  • More reliable cardinality foundation for query optimization

Comment on lines +139 to +142
let prev_snapshot_id = snapshot.prev_snapshot_id.map(|(id, _)| id);
if Some(table_statistics.snapshot_id) == prev_snapshot_id {
return Ok(PipelineBuildResult::create());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we compare with the current snapshot id?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants