Skip to content

msdousti/postgres-internals-17-changes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

New additions to PostgreSQL 17 Internals book since v.14

PostgreSQL Internals is an excellent book for those who want to understand how PostgreSQL works internally. The book is written by Egor Rogov.

The original version of the book is in Russian, but an older version of the book, covering PostgreSQL 14, is very well translated from Russian to English by Liudmila Mantrova. The Russian version is updated since then to cover newer versions of PostgreSQL, specifically, v.17.

Both the Russian and English versions are freely available online:

Since the book annotates the changes in each version in the margin using something like v. 17, I searched the Russian book for all newer versions since 14 (i.e., v.15, v.16, v.17), and translated the related text using Gemini. I've also provided some context notes occasionally, to help the reader understand the changes.

This should provide the enthusiasts with a quick overview of internal changes since v.14, that are mentioned in the book. Of course, there might have been changes in some paragraphs that are missed if the author has not annotated them in the margin of the book.

Table of Contents

Russian Book page Version
19 v.17
35 v.16
36 v.15
38 v.17
39 v.17
58 v.15
81 v.15
85 v.17
116 v.16
122 v.16
131 v.17
147 v.17
156 v.16
159 v.15
173 v.17
180 v.16
187 v.17
191 v.16
192 v.17
193 v.16
196 v.16
200 v.16
202 v.15
207 v.15
211 v.17
221 v.17
223 v.16
224 v.17
234 v.15
236 v.15
238 v.15
285 v.16
301 v.17
301 v.17
317 v.17
329 v.16
347 v.17
358 v.15
359 v.15
376 v.16
377 v.17
396 v.17
415 v.15
467 v.16
492 v.15
643 v.17

Page 19 – v.17

The text of the book and all examples are up to date for PostgreSQL 17. Some paragraphs have a version number marked in the margins. This means that the statement is valid for versions of PostgreSQL starting from the one indicated, while earlier versions either did not have the described feature at all or were designed differently. Such marks may be useful for those who have not yet upgraded their system to the latest release.

Page 35 – v.16

Sometimes it may be useful to change the strategy for certain columns. If it is known in advance that the data in a column is not compressible (for example, JPEG images are stored in the column), you can set the external strategy for it — this will save on useless compression attempts. You can change the strategy both when creating a table (using the STORAGE clause of the CREATE TABLE command) and subsequently:

ALTER TABLE t
    ALTER COLUMN d SET STORAGE external;

Page 36 – v.15

Two algorithms are available for compression: the traditional PGLZ and the more modern LZ4 (both are variants of the Lempel-Ziv dictionary compression algorithm). By default, compression is performed by the algorithm specified in the default_toast_compression parameter (default = pglz); the value can be set for individual columns in the COMPRESSION clause. Benchmarks show that LZ4 consumes fewer CPU resources at a compression level similar to PGLZ.

Page 38 – v.17

This sequence cannot be compressed. The chunk_id identifier is written to the table row:

SELECT pg_column_toast_chunk_id(c) -- starting from v.17
FROM t;
 pg_column_toast_chunk_id
−−−−−−−−−−−−−−−−−−−−−−−−−−
                    16390
(1 row)

Page 39 – v.17

The costs of decompressing values processed by TOAST can be estimated using the EXPLAIN command with the analyze and serialize parameters.

Page 58 – v.15

[MERGE is introduced in PostgreSQL 15]:

PostgreSQL has enough capabilities to solve complex tasks with a single SQL statement. In particular, it offers common table expressions (CTE) that can contain operators like INSERT, UPDATE, DELETE, and MERGE, as well as the INSERT ON CONFLICT operator that implements the following logic: insert the row if it does not exist, otherwise perform an update.

Page 81 – v.15

[MERGE is introduced in PostgreSQL 15]:

With a certain degree of abstraction, the UPDATE command can be regarded as two separate operations: DELETE and INSERT. First, the xmax value of the current row version is set to the transaction ID of the UPDATE command. Then a new version of this row is created; its xmin value will be the same as the xmax value of the previous version. In exactly the same way, the MERGE command "breaks down" into elementary inserts and deletes.

Page 85 – v.17

The clog is split into several files solely for convenience. Work with these files is conducted page by page, through buffers in the server's shared memory. The size of the clog cache is configured by the transaction_buffers parameter, and by default, it is set to shared_buffers / 512.

Page 116 – v.16

[Only the sentence regarding BRIN indexes refers to v.16, the rest is for the context]

The optimization called HOT-update (Heap-Only Tuple update) avoids the appearance of unnecessary index entries. Such an update is possible in three cases:

  • No indexes are created on the table (in this case, any update will be a HOT-update);
  • There are indexes on the table, but the columns being updated are not used in any of them;
  • The columns being updated are used in an index, but their values have not changed.

"Used in an index" here refers to the presence of an index (full or partial) on a column or an expression containing a column, as well as participation as a non-key column in an INCLUDE index — essentially any possible reference to a column in the CREATE INDEX command. The only exceptions are BRIN indexes: they do not contain references to specific table rows and therefore do not interfere with HOT updates in any way.

In a HOT-update scenario, the index page holds just one entry, pointing to the initial version of the table row. Later versions of that same row, residing on the table page, are connected sequentially through pointers within their version headers.

Versions that are not referenced from the index are marked with the HEAP_ONLY_TUPLE flag ("table-only row version"). If a version is part of a chain, it is marked with the HEAP_HOT_UPDATED flag.

If, during an index scan, the server lands on a table page and discovers a version with the HEAP_HOT_UPDATED flag, it understands that it should not stop and continues further through the entire update chain. Of course, visibility is checked for all row versions read this way before the client receives the result.

Page 122 – v.16

Breaks in HOT chains are tracked separately in the statistics:

SELECT n_tup_upd, n_tup_hot_upd, n_tup_newpage_upd
FROM pg_stat_all_tables
WHERE relid = 'hot'::regclass;
n_tup_upd | n_tup_hot_upd | n_tup_newpage_upd
----------+---------------+------------------
15        |            10 |                1
(1 row)

Here, out of fifteen UPDATE operations, eleven were HOT-updates: ten had enough space on the page, and one did not, which led to a break in the chain.

With frequent updates to columns that are not included in indexes, it may make sense to decrease the fillfactor parameter to reserve some space on the page for updates. Of course, it must be taken into account that the lower the fillfactor, the more empty space remains on the page and, consequently, the physical size of the table increases.

Page 131 – v.17

[Starting from v.17, the radix tree is introduced to store TIDs]

6.3 Vacuum Phases → Heap Scan

In the first phase, a heap scan is performed. The scanning process takes the visibility map into account: all pages tracked in this map are skipped because they are sure to contain no outdated tuples. If a tuple is beyond the horizon and is not required anymore, its TID is added to a special memory storage. These row versions cannot be deleted immediately because indexes may still reference them.

The storage is a prefix tree (radix tree) [backend/access/common/tidstore.c]. The keys are page numbers, and the nodes store bitmaps of row versions. The size of this structure is limited by the value of the maintenance_work_mem parameter (default = 64MB); memory is allocated for it as needed.

Compared to the regular array used prior to PostgreSQL 17, the prefix tree stores identifiers an order of magnitude more compactly, and its total size can exceed 1 GB — the maximum size of a memory fragment allocated at one time.

Page 147 – v.17

[In explaining the columns of pg_stat_progress_vacuum view: In v.17, some columns are removed (max_dead_tuples and num_dead_tuples), while some are added (max_dead_tuple_bytes, dead_tuple_bytes, num_dead_item_ids, indexes_total, and indexes_processed). The text explains some of these]

SELECT *
FROM pg_stat_progress_vacuum
\gx

−[ RECORD 1 ]−−−−−−−−+−−−−−−−−−−−−−−
pid                     | 14639
datid                   | 16391
datname                 | internals
relid                   | 16482
phase                   | scanning heap
heap_blks_total         | 172414
heap_blks_scanned       | 90517
heap_blks_vacuumed      | 0
index_vacuum_count      | 0
max_dead_tuple_bytes    | 1048576
dead_tuple_bytes        | 0
num_dead_item_ids       | 0
indexes_total           | 0
indexes_processed       | 0



SELECT *
FROM pg_stat_progress_vacuum
\gx

−[ RECORD 1 ]−−−−−−−−+−−−−−−−−−−−−−−−−−−
pid                     | 14639
datid                   | 16391
datname                 | internals
relid                   | 16482
phase                   | vacuuming indexes
heap_blks_total         | 172414
heap_blks_scanned       | 32733
heap_blks_vacuumed      | 0
index_vacuum_count      | 0
max_dead_tuple_bytes    | 1048576
dead_tuple_bytes        | 1049600
num_dead_item_ids       | 1898514
indexes_total           | 1
indexes_processed       | 0
  • dead_tuple_bytes and num_dead_item_ids — the volume and number of dead versions collected during this pass.

The storage efficiency can be estimated by the ratio of dead_tuple_bytes to num_dead_item_ids: in this example, approximately half a byte is spent per single TID.

Page 156 – v.16

[From English book] The previous UPDATE command has removed the visibility bit of the zero page, so the tuple that has an appropriate xmin age in this page will be frozen. But the first page will be skipped altogether:

=> VACUUM tfreeze;
=> SELECT * FROM heap_page('tfreeze',0,1);
ctid   | state          | xmin  | xmin_age | xmax
−−−−−−−+−−−−−−−−−−−−−-−−+−−−−−−−+−−−−−−−−−−+−−−−−−
(0,1)  | redirect to 3  |       |          |
(0,2)  | normal         | 856 f |        2 | 0 a
(0,3)  | normal         | 857 c |        1 | 0 a
(1,1)  | normal         | 856 c |        2 | 0 a
(1,2)  | normal         | 856 c |        2 | 0 a
(5 rows)

[Now note the change from the new version - column xmin, 2nd row is f and not c]

=> VACUUM tfreeze;
VACUUM
=> SELECT * FROM heap_page('tfreeze',0,1);
ctid   | state         | xmin   | xmin_age | xmax
−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−-+−−−−−−−−−−+−−−−−−
(0,1)  | redirect to 3 |        |          |
(0,2)  | normal        | 869 f  |        2 | 0 a
(0,3)  | normal        | 870 f  |        1 | 0 a
(1,1)  | normal        | 869 c  |        2 | 0 a
(1,2)  | normal        | 869 c  |        2 | 0 a
(5 rows)

Please note that on page zero, both row versions were frozen, even though the age of the xmin transaction in the second version does not exceed the value of the vacuum_freeze_min_age parameter (default = 50,000,000). Since the page has to be modified anyway, all row versions beyond the database horizon are frozen on it — this practically does not increase the cost of the operation, but it can save effort in the future.

Page 159 – v.15

Aggressive freezing of all pages in a table is performed no more frequently than once every vacuum_freeze_table_age - vacuum_freeze_min_age transactions (100 million transactions with default parameter values). It will be triggered less often if vacuuming in non-aggressive mode can also advance the relfrozenxid value. In any case, it must be kept in mind that too large a value for the vacuum_freeze_min_age parameter, instead of saving resources, can lead to excessive triggering of aggressive freezing and an increase in overhead costs.

Page 173 – v.17

[transaction_timeout is new in v.17]

Situations can arise where long transactions are caused not by necessity, but by errors in the application or driver. If the problem cannot be solved in a civilized way, the administrator has a couple of parameters at their disposal: idle_in_transaction_session_timeout (default = 0 ms), which determines the maximum lifetime of an idle transaction, and transaction_timeout (default = 0 ms), which limits the total duration of a transaction. After the specified time has elapsed (if it is not equal to zero), the transaction is terminated.

Page 180 – v.16

[Previously] The PostgreSQL community has already started this major effort, but it will take a long time for the actual results to appear.

This extensive work is already underway in the community [1], and the debug_io_direct parameter has been added for debugging, but practical results will not appear anytime soon.

[1] https://www.postgresql.org/message-id/flat/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de

Page 187 – v.17

[Previously] Let’s restart the instance to clear its buffer cache:

postgres$ pg_ctl restart -l /home/postgres/logfile

Let's clear the buffer of the page (alternatively, the entire buffer cache could be cleared by restarting the instance):

=> SELECT pg_buffercache_evict(281);
 pg_buffercache_evict 
----------------------
 t
(1 row)

Page 191 – v.16

[Before] Vacuuming strategy is used by the process of vacuuming when it performs a full table scan without taking the visibility map into account. The ring buffer is assigned 256 kB of RAM (32 standard pages).

Vacuuming strategy is applied by the vacuuming and analysis processes during a full table scan without considering the visibility map. The size of the ring is determined by the vacuum_buffer_usage_limit parameter (default = 2MB), but it can be overridden in the VACUUM or ANALYZE command.

Page 192 – v.17

[Before] Restart the server to clear the cache, as now it contains some heap pages that have been read during analysis.

postgres$ pg_ctl restart -l /home/postgres/logfile

Once the server is restarted, let’s read the whole table:

=> EXPLAIN (analyze, costs off, timing off, summary off)
SELECT id FROM big;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Seq Scan on big (actual rows=4097 loops=1)
(1 row)

Let's clear the cache of the table pages read during the analysis, after which we will read the entire table:

=> SELECT pg_buffercache_evict(bufferid)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);

=> EXPLAIN (analyze, costs off, timing off, summary off)
SELECT * FROM big;

             QUERY PLAN
------------------------------------
 Seq Scan on big (actual rows=4097 loops=1)
(1 row)

Page 193 – v.16

Detailed information regarding I/O passing through the buffer cache is accumulated in the pg_stat_io view. For example, you can examine I/O volume broken down by buffer ring usage strategies (the normal context corresponds to regular operations without a ring):

=> SELECT context, sum(reads) reads, sum(writes) writes,
   sum(hits) hits, sum(evictions) evictions, sum(reuses) reuses
FROM pg_stat_io
WHERE object = 'relation'
GROUP BY context;

context   | reads  | writes |   hits   | evictions | reuses
-----------+--------+--------+----------+-----------+--------
vacuum    | 658529 | 544152 |   168505 |      2649 | 653199
normal    | 103844 | 373088 | 71936923 |    277869 |
bulkwrite |      0 |      0 |        8 |         2 |      0
bulkread  | 256194 | 108324 |    94895 |    152228 | 105566
(4 rows)

In addition to the number of pages read and written (reads and writes), this shows the number of cache hits (hits) and cache evictions: "normal" evictions are accounted for in the evictions field, while evictions of pages from a buffer ring are in the reuses field.

Page 196 – v.16

[Context] Do not run such queries non-stop because the pg_buffercache extension locks the viewed buffers, even if only briefly.

Therefore, for monitoring purposes, it is better to use other functions of the extension — pg_buffercache_summary() and pg_buffercache_usage_counts() — which require fewer locks and, because of this, allow quickliy assessing the general (though perhaps slightly inconsistent) picture of what is happening.

Page 200 – v.16

The pg_stat_io view allows for obtaining data on I/O associated with temporary objects in local process caches:

=> SELECT context, sum(reads) reads, sum(writes) writes,
   sum(hits) hits, sum(evictions) evictions, sum(reuses) reuses
FROM pg_stat_io
WHERE object = 'temp relation'
GROUP BY context;

context | reads | writes | hits | evictions | reuses
--------+-------+--------+------+-----------+--------
normal  |     0 |      0 |    1 |         0 |
(1 row)

Page 202 – v.15

[Only the 2nd bullet point, unlogged sequences, is new in v.15]

The following actions are not logged:

  • operations on UNLOGGED tables
  • operations on unlogged sequences (typically associated with unlogged tables).
  • operations on temporary tables—since their lifetime is anyway limited by the session that spawns them

Page 207 – v.15

[Previously, it looked into pg_waldump directory, now it uses an extension]

You can look into the headers of the created log entries either with the pg_waldump utility or by using the pg_walinspect extension:

=> CREATE EXTENSION pg_walinspect;

=> SELECT start_lsn, resource_manager, xid, record_type,
   left(description,44) description, block_ref
FROM pg_get_wal_records_info('2/7CEE6290', '2/7CEE6300') \gx

-[ RECORD 1 ]----+-------------------------------------------
start_lsn        | 2/7CEE6290
resource_manager | Heap
xid              | 900
record_type      | HOT_UPDATE
description      | old_xmax: 900, old_off: 1, old_infobits: [],
block_ref        | blkref #0: rel 1663/16391/16572 fork main blk 0
-[ RECORD 2 ]----+-------------------------------------------
start_lsn        | 2/7CEE62D8
resource_manager | Transaction
xid              | 900
record_type      | COMMIT
description      | 2025-01-12 14:45:30.734485+03
block_ref        |

Here we see the headers of two records.

Page 211 – v.17

[Previously, it looked into pg_waldump directory, now it uses an extension]

[Also, in v.17, function pg_get_wal_records_info() was introduced instead of pg_get_wal_records_info_till_end_of_wal()]

Let's see how a checkpoint is reflected in the log. The CHECKPOINT_REDO record marks the start of a checkpoint, and CHECKPOINT_ONLINE marks its completion:

=> SELECT start_lsn,
   resource_manager,
   record_type,
   left(description,46)||'...' description
FROM pg_get_wal_records_info('2/7D7796D8', 'FFFFFFFF/FFFFFFFF')
   -- pg_get_wal_records_info_till_end_of_wal up to v.16
WHERE record_type LIKE 'CHECKPOINT%'
\gx

-[ RECORD 1 ]----+----------------------------------------------
start_lsn        | 2/7D7796D8
resource_manager | XLOG
record_type      | CHECKPOINT_REDO
description      | wal_level replica...
-[ RECORD 2 ]----+----------------------------------------------
start_lsn        | 2/7D779730
resource_manager | XLOG
record_type      | CHECKPOINT_ONLINE
description      | redo 2/7D7796D8; tli 1; prev tli 1; fpw true; ...

Between these two positions in the log, other records generated by the continuing work of the server will be located.

Page 221 – v.17

But the most useful information for tuning purposes is the operation statistics of the checkpoint process in the pg_stat_checkpointer view.

Prior to version 17, the checkpoint and background writer shared a single view, pg_stat_bgwriter, because previously both tasks were performed by the background writer process; then, in version 9.2, the checkpoint was separated into a distinct process, but the view remained shared for a long time.

=> SELECT * FROM pg_stat_checkpointer \gx

−[ RECORD 1 ]−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
num_timed           | 0
num_requested       | 26
restartpoints_timed | 0
restartpoints_req   | 0
restartpoints_done  | 0
write_time          | 313461
sync_time           | 21508
buffers_written     | 101888
stats_reset         | 2025−01−12 14:38:15.939618+03

Among other things, the view shows the number of checkpoints performed:

  • num_timed — scheduled (upon reaching the checkpoint_timeout interval);
  • num_requested — on demand (including upon reaching the max_wal_size log volume).

A high value in the num_requested column (compared to num_timed) indicates that, in reality, checkpoints are occurring more frequently than intended.

Fields starting with restartpoints are only relevant for replicas.

The pg_stat_bgwriter view shows background writer statistics:

=> SELECT * FROM pg_stat_bgwriter \gx
-[ RECORD 1 ]-----+------------------------------
buffers_clean     | 85694
maxwritten_clean  | 816
buffers_alloc     | 493103
stats_reset       | 2025-01-12 14:38:15.939618+03

For tuning purposes, the maxwritten_clean value is useful — this number shows how many times the background writer process stopped a scan cycle because it exceeded bgwriter_lru_maxpages (default = 100).

In a well-tuned system, backend processes perform almost no writes; the bulk of writing is handled by the checkpoint process (buffers_written) and the background writer (buffers_clean).

Page 223 – v.16

The most detailed information about I/O, broken down by process types (including background processes), can be obtained from the pg_stat_io view:

=> SELECT backend_type,
   sum(reads) reads, sum(writes) writes, sum(fsyncs) fsyncs
FROM pg_stat_io
GROUP BY backend_type
ORDER BY backend_type;

 backend_type        | reads  | writes | fsyncs 
---------------------+--------+--------+--------
 autovacuum launcher | 5      | 0      | 0      
 autovacuum worker   | 395549 | 292555 | 0      
 background worker   | 4243   | 0      | 0      
 background writer   |        | 85694  | 0      
 checkpointer        |        | 101886 | 602    
 client backend      | 631385 | 548682 | 0      
 slotsync worker     | 0      | 0      | 0      
 standalone backend  | 546    | 979    | 0      
 startup             | 0      | 0      | 0      
 wal sender          | 0      | 0      | 0
(10 rows)

Page 224 – v.17

There are a couple of situations where log files need to be read as well as written. The first is the obvious case of disaster recovery. The second arises when using streaming replication. The walsender process attempts to read WAL records from the WAL buffer, but if the replica fails to receive the log records while the required pages are still in the primary server's memory, the data will be read from disk. However, even in this case, access will be sequential, not random.

Before v.17 v.17+

Page 234 – v.15

Full page images in our case constitute a significant portion of the generated volume, which can be verified by looking at statistics with the pg_waldump utility using the --stats flag or the pg_get_wal_stats() function of the pg_walinspect extension.

Page 236 – v.15

In addition to the traditional PGLZ compression algorithm, two others are available: LZ4 and Zstandard (zstd). The LZ4 algorithm usually requires fewer CPU resources while showing approximately the same compression ratio. The Zstandard algorithm combines dictionary compression with entropy coding; it puts a heavier load on the processor but compresses data better.

=> ALTER SYSTEM SET wal_compression = zstd;
=> SELECT pg_reload_conf();
=> CHECKPOINT;
=> SELECT pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 2/84E33C08
(1 row)

postgres$ /usr/local/pgsql/bin/pgbench -t 20000 internals
=> SELECT pg_current_wal_insert_lsn();
 pg_current_wal_insert_lsn 
---------------------------
 2/857D7498
(1 row)

Size of log entries with Zstandard compression:

=> SELECT pg_size_pretty('2/857D7498'::pg_lsn - '2/84E33C08'::pg_lsn);
 pg_size_pretty 
----------------
 9870 kB
(1 row)

Conclusion: When there is a large number of full page images (due to checksums or full_page_writes, which is to say almost always), it makes sense to use compression. Several algorithms allow you to find a compromise between an additional CPU load and the reduction of log entry volume.

Page 238 – v.15

With the help of the already familiar pg_walinspect extension, let's look at the log contents.

=> SELECT start_lsn,
   resource_manager AS rmgr,
   record_type,
   (regexp_match(block_ref, '[0-9]+\/[0-9]+\/[0-9]+'))[1] AS rel
FROM -- pg_get_wal_records_info_till_end_of_wal up to v.16
        pg_get_wal_records_info('2/857D9460','FFFFFFFF/FFFFFFFF');

 start_lsn  | rmgr        | record_type | rel
−−−−−−−−−−−−+−−−−−−−−−−−−−+−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−
 2/857D9460 | Storage     | CREATE      |
 2/857D9490 | Heap        | UPDATE      | 1663/16391/1259
 2/857D9510 | Btree       | INSERT_LEAF | 1663/16391/2662
 2/857D9550 | Btree       | INSERT_LEAF | 1663/16391/2663
 2/857D9590 | Btree       | INSERT_LEAF | 1663/16391/3455
 2/857D95D0 | Transaction | COMMIT      |
(6 rows)

The first record (CREATE) indicates the creation of a new file for the relation (since the table is effectively overwritten during truncation).

Next are several records related to the system catalog. They reflect the modification of the pg_class table (UPDATE) and the addition of records to three indexes built on this table (INSERT_LEAF). This can be verified by comparing the file path shown in the rel column with the name of the relation.

Finally, there is a record for the transaction commit. The insertion of data into the table is not logged.

Page 285 – v.16

[Context: Section 14.2, Relation Extension Locks]

Previously: To minimize this risk, heap files are extended by several pages at once (in proportion to the number of processes awaiting the lock, but by not more than 512 pages per operation). An exception to this rule is B-tree index files, which are extended by one page at a time.

For efficiency, table files are expanded not by a single page, but by several at once (proportional to the number of processes waiting for a lock, but no more than 64 pages at a time)[*]. However, the expansion of B-tree-based index files occurs only one page at a time[**].

[*] backend/access/heap/hio.c, function RelationAddBlocks.

[**] backend/access/nbtree/nbtpage.c, function _bt_allocbuf.

Page 301 – v.17

[Context: Section 15.4, Monitoring Waits]

But processes can be waiting for other events too: IO, Client, IPC, and Extension. PG v.17 adds yet another event:

InjectionPoint — execution of code at an injection point (such as an event used for testing PostgreSQL and does not occur during operation).

Page 301 – v.17

Context: There are situations when a process simply does not perform any useful work. The following waits belong to this category:

  • Activity — background processes in their main cycle;
  • Timeout — a timer.

As a rule, such waits are "normal" and do not indicate any problems. However, for example, a SpinDelay event of the Timeout type indicates a wait for a spin-lock.

Page 317 – v.17

[Context: Section 16.2, Simple Query Protocol] Planning the order of joins - query that does not specify any explicit joins for tables listed in the FROM clause:

SELECT ...
FROM a, b, c, d, e
WHERE ...

Here the planner will have to consider all the possible pairs of joins.

Below is a graph showing the dependence of the memory amount required for planning such a query on the number of tables in it (data obtained by the EXPLAIN command with the MEMORY parameter):

Planning time also approximately doubles with each added table.

Page 329 – v.16

[Context: Section 16.3, Extended Query Protocol] Parameter Binding Before a prepared statement gets executed, the actual parameter values have to be bound.

=> EXECUTE plane('733');

Perhaps the binding stage is illustrated more clearly when using the extended protocol directly in psql:

=> \bind '733'
=> SELECT * FROM aircrafts WHERE aircraft_code = $1;
 aircraft_code |      model      | range 
---------------+-----------------+-------
 733           | Boeing 737-300  |  4200
(1 row)

Page 347 – v.17

[Context: Section 17.6, Statistics for Non-Scalar Data Types] Previously:

  • For range types, PostgreSQL builds distribution histograms for range length and lower > and upper boundaries of the range. These histograms are used for estimating selectivity of various operations on these types, but the pg_stats view does not display them.
  • For non-empty ranges, histograms are collected for the distribution of lower and upper bounds (the range_bounds_histogram array) and length (the range_length_histogram array). The fraction of empty ranges is noted separately (the range_empty_frac column). This data is used for selectivity estimation of various operations with range types. Similar statistics are also used for multirange types.

Page 358 – v.15

_[Context: Section 18.1, Pluggable Storage Engines)

You can specify the engine to use when creating a table (CREATE TABLE ... USING); otherwise, the default engine listed in the default_table_access_method (default = heap) parameter will be applied.

Subsequently, the engine can be changed (ALTER TABLE ... SET ACCESS METHOD), but, of course, at the cost of rewriting all the data.

Page 359 – v.15

Context: The following core components can be used by all table access methods:

  • transaction manager, including ACID and snapshot isolation support
  • buffer manager
  • I/O subsystem
  • TOAST
  • optimizer and executor
  • index support

[Added in v.15]:

  • the WAL, including support for specific operations using the Generic WAL Records mechanism [1] or Custom WAL Resource Managers [2];

[1] https://www.postgresql.org/docs/17/generic-wal.html

[2] https://www.postgresql.org/docs/17/custom-rmgr.html

Page 376 – v.16

[Context: Section 18.5, subsection "Non-Parallelizable Queries"] Previously: To check whether a query can be parallelized at all, you can temporarily switch on the force_parallel_mode parameter.

This parameter is deprecated in v.15, and renamed as debug_parallel_query in v.16. The text replaces the previous paragraph with the following:

To check whether a query can be parallelized in principle, you can temporarily enable the debug_parallel_query parameter. In this case, the planner will build parallel plans in all instances where it is possible:

=> EXPLAIN SELECT * FROM flights;
                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on flights  (cost=0.00..4772.67 rows=214867 width=63)
(1 row)

=> SET debug_parallel_query = on;
=> EXPLAIN SELECT * FROM flights;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather  (cost=1000.00..27259.37 rows=214867 width=63)
   Workers Planned: 1
   Single Copy: true
   ->  Seq Scan on flights  (cost=0.00..4772.67 rows=214867 width=63)
(4 rows)

Now we understand that the planner considered a parallel plan but chose the more efficient sequential scan in this particular case.

Page 377 – v.17

[Context: Section 18.5, subsection "Parallel Restricted Queries"]

Note, however, that a CTE itself can be computed in the parallel mode if it turns out to be less expensive.

Correlated subqueries. Uncorrelated subqueries, evaluated once and represented in the plan by the InitPlan node, can participate in a parallel plan:

=> EXPLAIN (costs off)
SELECT * FROM flights f1
WHERE f1.aircraft_code = (
   SELECT '733' -- InitPlan 1
)
UNION ALL
SELECT * FROM flights f2
WHERE f2.aircraft_code = (
   SELECT '320' -- InitPlan 2
);

                           QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Gather
  Workers Planned: 2
  −> Parallel Append
      −> Parallel Seq Scan on flights f2
          Filter: ((aircraft_code)::text = (InitPlan 2).col1)
          InitPlan 2
             −> Result
  −> Parallel Seq Scan on flights f1
      Filter: ((aircraft_code)::text = (InitPlan 1).col1)
          InitPlan 1
             −> Result
(11 rows)

The Parallel Append node here implements the parallel execution of the UNION ALL command. Worker processes distribute the subqueries among themselves (if there are more processes than subqueries, several can work on a single task). Once all data from a subquery has been received, the freed-up process connects to another subquery until all data has been merged.

Page 396 – v.17

[Context: Section 19.3, Indexing Engine Interface; under the subsection "Access Method Properties"]

There are actually many more properties, but many are not accessible at the SQL level.[1] Here are some of them:

  • CAN PRED LOCKS: Support for predicate locks. Missing in spgist and brin access methods; when scanning indexes of these types at the Serializable isolation level, predicate locks will be set on the entire table, as with a sequential scan, which will lead to a decrease in performance.

  • CAN BUILD PARALLEL: The ability to build an index with multiple processes in parallel. Supported by btree and brin access methods.

[1] include/access/amapi.h, struct IndexAmRoutine.

Page 415 – v.15

[Context: Section 20.3, Bitmap Scans]

The text is explaining "Bitmap Heap Scan," and how it uses prefetching to speed up the scan. The following explanation is added at the end:

Asynchronous prefetching is also used by some other internal processes:

  • for index pages when heap rows are being deleted
  • for heap pages during analysis (ANALYZE)

The prefetch depth is defined by the maintenance_io_concurrency.

In addition to the two bulletpoints above, v.15 adds the following:

  • When restoring, WAL pages are prefetched.

Page 467 – v.16

[Context: Section 22.1, Hash Joins]

Previously: Parallel hash joins are currently not supported for right and full joins.

As of v.16:

Parallel hash join is supported for all types of joins: not only for inner and left outer joins, but also for right and full outer joins.

Page 492 – v.15

_[Context: Section 23.3, Distinct Values and Grouping]

The Unique node can participate in a parallel plan, removing duplicates from the list received from worker processes:

=> EXPLAIN (costs off) SELECT DISTINCT fare_conditions
FROM ticket_flights;

              QUERY PLAN
------------------------------------------------------
 Unique
   -> Gather Merge
        Workers Planned: 2
        -> Sort
             Sort Key: fare_conditions
             -> HashAggregate
                  Group Key: fare_conditions
                  -> Parallel Seq Scan on ticket_flights
(8 rows)

In this example, worker processes extract unique values in their data sets using hashing in the HashAggregate node; the general list, collected by the Gather node, is sorted in the Sort node and passed to the Unique node for final duplicate removal.

Page 643 – v.17

[Related; see Page 396 – v.17]

Along with B-trees, BRIN indexes for large tables can be created in parallel. The table is scanned by several processes that calculate summary information for their respective portions of the data. This information is sorted by zones (memory of size maintenance_work_mem is allocated), then the lead process aggregates it and forms the index.[1]

[1] backend/access/brin/brin.c, function brinbuild.

About

New additions to PostgreSQL 17 Internals book since v.14

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published