-
Notifications
You must be signed in to change notification settings - Fork 196
Fast analyze: implement fast ANALYZE for append-optimized tables #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
|
Does new algorithm support PAX? |
Member
Author
|
yeah, but need some works. |
Contributor
|
Do we have test results for comparison? |
my-ship-it
reviewed
Jul 24, 2025
my-ship-it
reviewed
Jul 24, 2025
my-ship-it
reviewed
Jul 24, 2025
Member
Author
Test tpcds 100GBorigin cbdb********************************************************************************
Analyze
********************************************************************************
schema_name seconds
tpcds 28.000000
(2 rows)fast analyze cbdbAnalyze
********************************************************************************
schema_name seconds
tpcds 16.000000
(2 rows)gp7Analyze
********************************************************************************
schema_name seconds
tpcds 16
(2 rows)analyze tpcds performace align with gp7. |
my-ship-it
approved these changes
Jul 29, 2025
Contributor
my-ship-it
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jiaqizho
reviewed
Jul 31, 2025
6f018ed to
ada640c
Compare
ada640c to
619b845
Compare
gfphoenix78
reviewed
Aug 5, 2025
gfphoenix78
reviewed
Aug 5, 2025
gfphoenix78
reviewed
Aug 5, 2025
jiaqizho
approved these changes
Aug 7, 2025
Contributor
jiaqizho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Prior to this patch, GPDB ANALYZE on large AO/CO tables is a time-consuming process. This is because PostgreSQL's two-stage sampling method didn't work well on AO/CO tables. GPDB had to unpack all varblocks till to the target tuples, which could easily result in almost full table scanning if sampling tuples fall around the end of the table. Denis Smirnov <[email protected]> 's PR greenplum-db#11190 introduced a `logical` block concept containing fixed number of tuples to support PG's two-stage sampling mechanism, also it sped up fetching target tuples by skipping uncompression of varblock content. Thanks for Denis Smirnov's great contribution! Also, thanks for Ashwin Agrawal <[email protected]> 's advice on leveraging AO Block Directory to locate the target sample row without scanning unnecessary varblocks, which brings another significant performance improvement with caching warmed up. In addition, - GPDB has AO/CO specific feature that storing total tuple count in an auxiliary table which could be easily obtained without too much overhead. - GPDB has `fetch` facilities support finding varblock based on AOTupleId without uncompressing unnecessary varblocks. Based on above works and properties, we re-implemented AO/CO ANALYZE sampling by combining Knuth's Algorithm S and varblock skipping in this patch, to address the time-consuming problem. We didn't impelment two-stage sampling for AO/CO as the total size of data set (total tuple count) could be known in advance hence Algorithm S is sufficient to satisfy the sampling requirement. Special thanks Zhenghua Lyu (https://kainwen.com/) for detail analysis of Algorithm S: [Analysis of Algorithm S](https://kainwen.com/2022/11/06/analysis-of-algorithm-s) and follow up [discussion](https://stackoverflow.com/questions/74345921/performance-comparsion-algorithm-s-and-algorithm-z?noredirect=1#comment131292564_74345921) Here is a simple example to show the optimization effect: [AO with compression, with Fast Analyze enabled] create table ao (a int, b inet, c inet) with (appendonly=true, orientation=row, compresstype=zlib, compresslevel=3); insert into ao select i, (select ((i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text))::inet, (select ((i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text || '.' || (i%255)::text))::inet from generate_series(1,10000000)i; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; insert into ao select * from ao; select count(*) from ao; count ------------ 1280000000 (1 row) gpadmin=# analyze ao; ANALYZE Time: 2814.939 ms (00:02.815) gpadmin=# [with block directory and caching warmed] gpadmin=# analyze ao; ANALYZE Time: 1605.342 ms (00:01.605) gpadmin=# [Legacy Analyze] gpadmin=# analyze ao; ANALYZE Time: 59711.905 ms (00:59.712) gpadmin=# [Heap without compression] create table heap (a int, b inet, c inet); insert same data set gpadmin=# analyze heap; ANALYZE Time: 2087.694 ms (00:02.088) gpadmin=# Co-authored-by: Soumyadeep Chakraborty <[email protected]> Reviewed by: Ashwin Agrawal, Soumyadeep Chakraborty, Zhenglong Li, Qing Ma
…ection This commit addresses several issues with fast analyze: 1. For PAX tables, we now properly estimate the number of blocks by using table_relation_estimate_size() rather than RelationGetNumberOfBlocks(), since PAX uses non-fixed block layout. This provides more accurate sampling for PAX tables. 2. Simplified the acquisition function selection logic by always using gp_acquire_sample_rows_func for regular tables, removing the conditional check for rd_tableam->relation_acquire_sample_rows. This makes the code more straightforward and consistent. 3. Fixed an issue in datumstream.c by resetting blockRowCount when closing a file during analyze operations.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
first commit fast-analyze: implement fast ANALYZE for append-optimized tables patch is cherry pick from gpdb.
thanks to greenplum .
then Fix fast analyze for PAX tables and simplify acquisition function selection is the fix patch.
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheckmake -C src/test installcheck-cbdb-parallelImpact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions