Skip to content

Conversation

zhyass
Copy link
Member

@zhyass zhyass commented Aug 10, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Summary:

This PR improves the ANALYZE TABLE process by directly merging pre-collected block-level HyperLogLog (HLL) data, instead of relying on query-based calculations.

  • Previously, HLL statistics were calculated dynamically during ANALYZE TABLE through queries.

  • Now, the block-level HLL data is directly used to merge statistics, reducing calculation time.

  • Reduces the cost of ANALYZE TABLE by leveraging existing statistics, enhancing performance, especially for large tables.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@zhyass zhyass marked this pull request as draft August 10, 2025 18:14
@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 10, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@databendlabs databendlabs deleted a comment from github-actions bot Aug 11, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
@zhyass zhyass added ci-benchmark Benchmark: run all test and removed ci-benchmark Benchmark: run all test labels Aug 11, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18514-858657c-1754943720

note: this image tag is only available for internal use.

@zhyass zhyass marked this pull request as ready for review August 12, 2025 03:12
Copy link
Contributor

Docker Image for PR

  • tag: pr-18514-858657c-1754970108

note: this image tag is only available for internal use.

@databendlabs databendlabs deleted a comment from github-actions bot Aug 12, 2025
Copy link
Member

@sundy-li sundy-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BohuTANG
Copy link
Member

NDV (Number of Distinct Values) Statistics Accuracy Comparison Report

Data Source and Methodology

  • Estimated NDV: From system.columns.ndv field, collected via ANALYZE TABLE operations
  • Actual NDV: Calculated using COUNT(DISTINCT column_name) queries against real data
  • Comparison Scope: 21 key columns across major TPC-DS tables
  • Database Versions: tpcds_100 vs tpcds_100_v2 (different statistics collection methods)

Complete NDV Accuracy Comparison Table

No. Table.Column Actual NDV tpcds_100 Estimated tpcds_100_v2 Estimated tpcds_100 Error % tpcds_100_v2 Error % tpcds_100 Grade tpcds_100_v2 Grade
1 catalog_sales.cs_bill_customer_sk 1,999,335 2,000,000 1,785,814 0.03% 10.68% Excellent Poor
2 catalog_sales.cs_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
3 catalog_sales.cs_quantity 100 92 100 8.00% 0.00% Fair Perfect
4 customer.c_birth_country 211 219 216 3.79% 2.37% Good Good
5 customer.c_birth_month 12 12 12 0.00% 0.00% Perfect Perfect
6 customer.c_birth_year 69 64 64 7.25% 7.25% Fair Fair
7 customer.c_customer_id 2,000,000 1,858,387 2,275,552 7.08% 13.78% Fair Poor
8 customer.c_customer_sk 2,000,000 2,000,000 1,785,814 0.00% 10.71% Perfect Poor
9 date_dim.d_date 73,049 73,049 66,817 0.00% 8.53% Perfect Fair
10 date_dim.d_date_sk 73,049 68,193 73,049 6.65% 0.00% Fair Perfect
11 date_dim.d_month_seq 2,401 2,401 2,401 0.00% 0.00% Perfect Perfect
12 date_dim.d_year 201 201 184 0.00% 8.46% Perfect Fair
13 item.i_brand 712 698 782 1.97% 9.83% Good Fair
14 item.i_category 10 10 9 0.00% 10.00% Perfect Poor
15 item.i_class 99 102 110 3.03% 11.11% Good Poor
16 item.i_item_id 102,000 106,922 110,991 4.83% 8.81% Good Fair
17 item.i_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
18 store_sales.ss_customer_sk 1,999,984 2,000,000 1,785,814 0.00% 10.71% Excellent Poor
19 store_sales.ss_item_sk 204,000 203,827 190,915 0.08% 6.41% Excellent Fair
20 store_sales.ss_quantity 100 92 100 8.00% 0.00% Fair Perfect
21 store_sales.ss_store_sk 201 201 211 0.00% 4.98% Perfect Good

Key Findings and Analysis

Significant Performance Differences Detected

1. tpcds_100 Shows Superior Accuracy Overall

  • Perfect/Excellent Grades: 11 out of 21 columns (52.4%)
  • Average Error Rate: 2.37%
  • Poor Grades: Only 0 columns

2. tpcds_100_v2 Shows Mixed Performance

  • Perfect/Excellent Grades: 4 out of 21 columns (19.0%)
  • Average Error Rate: 6.71%
  • Poor Grades: 6 columns (28.6%) - Significant concern

Critical Problem Areas in tpcds_100_v2

  1. Customer Keys: Major underestimation in customer-related surrogate keys

    • cs_bill_customer_sk: 10.68% error
    • c_customer_sk: 10.71% error
    • ss_customer_sk: 10.71% error
  2. Item Classifications: Poor estimation of categorical data

    • i_category: 10.00% error (9 vs 10 actual)
    • i_class: 11.11% error (110 vs 99 actual)
  3. Customer Identifiers: Significant overestimation

    • c_customer_id: 13.78% error (2.28M vs 2M actual)

Areas Where tpcds_100_v2 Performs Better

  • Quantity Fields: Perfect accuracy on discrete value ranges
  • Some Date Keys: Better estimation on certain temporal dimensions

Statistical Quality Impact

  1. Query Optimization Reliability: tpcds_100 provides more reliable cardinality estimates
  2. Join Cost Estimation: tpcds_100_v2's poor customer key estimates could lead to suboptimal join ordering
  3. Memory Planning: Underestimated NDV values may cause inadequate hash table sizing

Conclusion

tpcds_100 demonstrates significantly superior NDV estimation accuracy compared to tpcds_100_v2, with:

  • 2.8x better average accuracy (2.37% vs 6.71% error rate)
  • Zero poor-grade estimations vs 6 poor-grade columns in v2
  • More reliable cardinality foundation for query optimization

@dantengsky dantengsky merged commit 2ba98f2 into databendlabs:main Aug 13, 2025
147 of 153 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-benchmark Benchmark: run all test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants