Fast analyze: implement fast ANALYZE for append-optimized tables by yjhjstz · Pull Request #1241 · apache/cloudberry

yjhjstz · 2025-07-17T17:33:17Z

first commit fast-analyze: implement fast ANALYZE for append-optimized tables patch is cherry pick from gpdb.

thanks to greenplum .

then Fix fast analyze for PAX tables and simplify acquisition function selection is the fix patch.

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

my-ship-it · 2025-07-18T01:40:24Z

Does new algorithm support PAX?

yjhjstz · 2025-07-18T01:43:39Z

yeah, but need some works.

my-ship-it · 2025-07-24T07:10:47Z

Do we have test results for comparison?

src/backend/access/appendonly/appendonlyam.c

yjhjstz · 2025-07-28T17:50:08Z

Test tpcds 100GB

origin cbdb

********************************************************************************
Analyze
********************************************************************************
schema_name     seconds
tpcds   28.000000
(2 rows)

fast analyze cbdb

Analyze
********************************************************************************
schema_name     seconds
tpcds   16.000000
(2 rows)

gp7

Analyze
********************************************************************************
schema_name     seconds
tpcds   16
(2 rows)

analyze tpcds performace align with gp7.

my-ship-it

LGTM

src/backend/access/aocs/aocsam.c

src/backend/access/appendonly/appendonlyam.c

src/include/cdb/cdbappendonlyblockdirectory.h

jiaqizho

LGTM

Prior to this patch, GPDB ANALYZE on large AO/CO tables is a time-consuming process. This is because PostgreSQL's two-stage sampling method didn't work well on AO/CO tables. GPDB had to unpack all varblocks till to the target tuples, which could easily result in almost full table scanning if sampling tuples fall around the end of the table. Denis Smirnov <sd@picodata.io> 's PR greenplum-db#11190 introduced a `logical` block concept containing fixed number of tuples to support PG's two-stage sampling mechanism, also it sped up fetching target tuples by skipping uncompression of varblock content. Thanks for Denis Smirnov's great contribution! Also, thanks for Ashwin Agrawal <aashwin@vmware.com> 's advice on leveraging AO Block Directory to locate the target sample row without scanning unnecessary varblocks, which brings another significant performance improvement with caching warmed up. In addition, - GPDB has AO/CO specific feature that storing total tuple count in an auxiliary table which could be easily obtained without too much overhead. - GPDB has `fetch` facilities support finding varblock based on AOTupleId without uncompressing unnecessary varblocks. Based on above works and properties, we re-implemented AO/CO ANALYZE sampling by combining Knuth's Algorithm S and varblock skipping in this patch, to address the time-consuming problem. We didn't impelment two-stage sampling for AO/CO as the total size of data set (total tuple count) could be known in advance hence Algorithm S is sufficient to satisfy the sampling requirement. Special thanks Zhenghua Lyu (https://kainwen.com/) for detail analysis of Algorithm S: [Analysis of Algorithm S](https://kainwen.com/2022/11/06/analysis-of-algorithm-s) and follow up [discussion](https://stackoverflow.com/questions/74345921/performance-comparsion-algorithm-s-and-algorithm-z?noredirect=1#comment131292564_74345921) Here is a simple example to show the optimization effect: [AO with compression, with Fast Analyze enabled] create table ao (a int, b inet, c inet) with (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3); insert into ao select i, (select ((i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text))::inet, (select ((i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text))::inet from generate_series(1,10000000)i; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; select count(*) from ao; count ------------ 1280000000 (1 row) gpadmin=# analyze ao; ANALYZE Time: 2814.939 ms (00:02.815) gpadmin=# [with block directory and caching warmed] gpadmin=# analyze ao; ANALYZE Time: 1605.342 ms (00:01.605) gpadmin=# [Legacy Analyze] gpadmin=# analyze ao; ANALYZE Time: 59711.905 ms (00:59.712) gpadmin=# [Heap without compression] create table heap (a int, b inet, c inet); insert same data set gpadmin=# analyze heap; ANALYZE Time: 2087.694 ms (00:02.088) gpadmin=# Co-authored-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Reviewed by: Ashwin Agrawal, Soumyadeep Chakraborty, Zhenglong Li, Qing Ma

…ection This commit addresses several issues with fast analyze: 1. For PAX tables, we now properly estimate the number of blocks by using table_relation_estimate_size() rather than RelationGetNumberOfBlocks(), since PAX uses non-fixed block layout. This provides more accurate sampling for PAX tables. 2. Simplified the acquisition function selection logic by always using gp_acquire_sample_rows_func for regular tables, removing the conditional check for rd_tableam->relation_acquire_sample_rows. This makes the code more straightforward and consistent. 3. Fixed an issue in datumstream.c by resetting blockRowCount when closing a file during analyze operations.

yjhjstz force-pushed the fast-analyze branch from ee3a187 to 305a9c1 Compare July 17, 2025 17:33

my-ship-it self-requested a review July 18, 2025 01:39

yjhjstz marked this pull request as ready for review July 18, 2025 01:47

my-ship-it reviewed Jul 24, 2025

View reviewed changes

src/backend/access/appendonly/appendonlyam.c Show resolved Hide resolved

my-ship-it reviewed Jul 24, 2025

View reviewed changes

src/backend/access/appendonly/appendonlyam.c Show resolved Hide resolved

my-ship-it reviewed Jul 24, 2025

View reviewed changes

src/backend/access/appendonly/appendonlyam.c Show resolved Hide resolved

yjhjstz force-pushed the fast-analyze branch from 305a9c1 to b8f6df8 Compare July 28, 2025 22:29

my-ship-it approved these changes Jul 29, 2025

View reviewed changes

yjhjstz requested review from gfphoenix78 and jiaqizho July 30, 2025 14:04

jiaqizho reviewed Jul 31, 2025

View reviewed changes

src/backend/access/aocs/aocsam.c Outdated Show resolved Hide resolved

src/backend/access/aocs/aocsam.c Outdated Show resolved Hide resolved

src/backend/access/aocs/aocsam.c Show resolved Hide resolved

yjhjstz force-pushed the fast-analyze branch 2 times, most recently from 6f018ed to ada640c Compare July 31, 2025 18:49

yjhjstz requested a review from jiaqizho August 4, 2025 16:45

my-ship-it force-pushed the fast-analyze branch from ada640c to 619b845 Compare August 5, 2025 01:22

gfphoenix78 reviewed Aug 5, 2025

View reviewed changes

src/backend/access/appendonly/appendonlyam.c Show resolved Hide resolved

gfphoenix78 reviewed Aug 5, 2025

View reviewed changes

src/backend/access/appendonly/appendonlyam.c Show resolved Hide resolved

gfphoenix78 reviewed Aug 5, 2025

View reviewed changes

src/include/cdb/cdbappendonlyblockdirectory.h Show resolved Hide resolved

yjhjstz requested a review from gfphoenix78 August 6, 2025 13:51

jiaqizho approved these changes Aug 7, 2025

View reviewed changes

Haolin Wang and others added 2 commits August 7, 2025 21:29

yjhjstz force-pushed the fast-analyze branch from 619b845 to e00cf07 Compare August 7, 2025 13:29

yjhjstz merged commit 6e33101 into apache:main Aug 7, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast analyze: implement fast ANALYZE for append-optimized tables#1241

Fast analyze: implement fast ANALYZE for append-optimized tables#1241
yjhjstz merged 2 commits intoapache:mainfrom
yjhjstz:fast-analyze

yjhjstz commented Jul 17, 2025 •

edited

Loading

Uh oh!

my-ship-it commented Jul 18, 2025

Uh oh!

yjhjstz commented Jul 18, 2025

Uh oh!

my-ship-it commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yjhjstz commented Jul 28, 2025

Uh oh!

my-ship-it left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiaqizho left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yjhjstz commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

Uh oh!

my-ship-it commented Jul 18, 2025

Uh oh!

yjhjstz commented Jul 18, 2025

Uh oh!

my-ship-it commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yjhjstz commented Jul 28, 2025

Test tpcds 100GB

origin cbdb

fast analyze cbdb

gp7

Uh oh!

my-ship-it left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiaqizho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yjhjstz commented Jul 17, 2025 •

edited

Loading