PostgreSQL Internals is an excellent book for those who want to understand how PostgreSQL works internally. The book is written by Egor Rogov.
The original version of the book is in Russian, but an older version of the book, covering PostgreSQL 14, is very well translated from Russian to English by Liudmila Mantrova. The Russian version is updated since then to cover newer versions of PostgreSQL, specifically, v.17.
Both the Russian and English versions are freely available online:
Since the book annotates the changes in each version in the margin using something like v. 17, I searched the Russian
book for all newer versions since 14 (i.e., v.15, v.16, v.17), and translated the related text using Gemini. I've
also provided some context notes occasionally, to help the reader understand the changes.
This should provide the enthusiasts with a quick overview of internal changes since v.14, that are mentioned in the book. Of course, there might have been changes in some paragraphs that are missed if the author has not annotated them in the margin of the book.
| Russian Book page | Version |
|---|---|
| 19 | v.17 |
| 35 | v.16 |
| 36 | v.15 |
| 38 | v.17 |
| 39 | v.17 |
| 58 | v.15 |
| 81 | v.15 |
| 85 | v.17 |
| 116 | v.16 |
| 122 | v.16 |
| 131 | v.17 |
| 147 | v.17 |
| 156 | v.16 |
| 159 | v.15 |
| 173 | v.17 |
| 180 | v.16 |
| 187 | v.17 |
| 191 | v.16 |
| 192 | v.17 |
| 193 | v.16 |
| 196 | v.16 |
| 200 | v.16 |
| 202 | v.15 |
| 207 | v.15 |
| 211 | v.17 |
| 221 | v.17 |
| 223 | v.16 |
| 224 | v.17 |
| 234 | v.15 |
| 236 | v.15 |
| 238 | v.15 |
| 285 | v.16 |
| 301 | v.17 |
| 301 | v.17 |
| 317 | v.17 |
| 329 | v.16 |
| 347 | v.17 |
| 358 | v.15 |
| 359 | v.15 |
| 376 | v.16 |
| 377 | v.17 |
| 396 | v.17 |
| 415 | v.15 |
| 467 | v.16 |
| 492 | v.15 |
| 643 | v.17 |
The text of the book and all examples are up to date for PostgreSQL 17. Some paragraphs have a version number marked in the margins. This means that the statement is valid for versions of PostgreSQL starting from the one indicated, while earlier versions either did not have the described feature at all or were designed differently. Such marks may be useful for those who have not yet upgraded their system to the latest release.
Sometimes it may be useful to change the strategy for certain columns. If it is known in advance that the data in a
column is not compressible (for example, JPEG images are stored in the column), you can set the external strategy for
it — this will save on useless compression attempts. You can change the strategy both when creating a table (using the
STORAGE clause of the CREATE TABLE command) and subsequently:
ALTER TABLE t
ALTER COLUMN d SET STORAGE external;Two algorithms are available for compression: the traditional PGLZ and the more modern LZ4 (both are variants of the
Lempel-Ziv dictionary compression algorithm). By default, compression is performed by the algorithm specified in the
default_toast_compression parameter (default = pglz); the value can be set for individual columns in the
COMPRESSION clause. Benchmarks show that LZ4 consumes fewer CPU resources at a compression level similar to PGLZ.
This sequence cannot be compressed. The chunk_id identifier is written to the table row:
SELECT pg_column_toast_chunk_id(c) -- starting from v.17
FROM t; pg_column_toast_chunk_id
−−−−−−−−−−−−−−−−−−−−−−−−−−
16390
(1 row)
The costs of decompressing values processed by TOAST can be estimated using the EXPLAIN command with the analyze
and serialize parameters.
[
MERGEis introduced in PostgreSQL 15]:
PostgreSQL has enough capabilities to solve complex tasks with a single SQL statement. In particular, it offers common
table expressions (CTE) that can contain operators like INSERT, UPDATE, DELETE, and MERGE, as well as the
INSERT ON CONFLICT operator that implements the following logic: insert the row if it does not exist, otherwise
perform an update.
[
MERGEis introduced in PostgreSQL 15]:
With a certain degree of abstraction, the UPDATE command can be regarded as two separate operations: DELETE and
INSERT. First, the xmax value of the current row version is set to the transaction ID of the UPDATE command.
Then a new version of this row is created; its xmin value will be the same as the xmax value of the previous
version. In exactly the same way, the MERGE command "breaks down" into elementary inserts and deletes.
The clog is split into several files solely for convenience. Work with these files is conducted page by page, through
buffers in the server's shared memory. The size of the clog cache is configured by the transaction_buffers
parameter, and by default, it is set to shared_buffers / 512.
[Only the sentence regarding
BRINindexes refers to v.16, the rest is for the context]
The optimization called HOT-update (Heap-Only Tuple update) avoids the appearance of unnecessary index entries. Such an update is possible in three cases:
- No indexes are created on the table (in this case, any update will be a HOT-update);
- There are indexes on the table, but the columns being updated are not used in any of them;
- The columns being updated are used in an index, but their values have not changed.
"Used in an index" here refers to the presence of an index (full or partial) on a column or an expression containing a
column, as well as participation as a non-key column in an INCLUDE index — essentially any possible reference to a
column in the CREATE INDEX command. The only exceptions are BRIN indexes: they do not contain references to specific
table rows and therefore do not interfere with HOT updates in any way.
In a HOT-update scenario, the index page holds just one entry, pointing to the initial version of the table row. Later versions of that same row, residing on the table page, are connected sequentially through pointers within their version headers.
Versions that are not referenced from the index are marked with the HEAP_ONLY_TUPLE flag ("table-only row version").
If a version is part of a chain, it is marked with the HEAP_HOT_UPDATED flag.
If, during an index scan, the server lands on a table page and discovers a version with the HEAP_HOT_UPDATED flag, it
understands that it should not stop and continues further through the entire update chain. Of course, visibility is
checked for all row versions read this way before the client receives the result.
Breaks in HOT chains are tracked separately in the statistics:
SELECT n_tup_upd, n_tup_hot_upd, n_tup_newpage_upd
FROM pg_stat_all_tables
WHERE relid = 'hot'::regclass;n_tup_upd | n_tup_hot_upd | n_tup_newpage_upd
----------+---------------+------------------
15 | 10 | 1
(1 row)
Here, out of fifteen UPDATE operations, eleven were HOT-updates: ten had enough space on the page, and one did not,
which led to a break in the chain.
With frequent updates to columns that are not included in indexes, it may make sense to decrease the fillfactor parameter to reserve some space on the page for updates. Of course, it must be taken into account that the lower the fillfactor, the more empty space remains on the page and, consequently, the physical size of the table increases.
[Starting from v.17, the radix tree is introduced to store TIDs]
6.3 Vacuum Phases → Heap Scan
In the first phase, a heap scan is performed. The scanning process takes the visibility map into account: all pages tracked in this map are skipped because they are sure to contain no outdated tuples. If a tuple is beyond the horizon and is not required anymore, its TID is added to a special memory storage. These row versions cannot be deleted immediately because indexes may still reference them.
The storage is a prefix tree (radix tree) [backend/access/common/tidstore.c]. The keys are page numbers, and the
nodes store bitmaps of row versions. The size of this structure is limited by the value of the maintenance_work_mem
parameter (default = 64MB); memory is allocated for it as needed.
Compared to the regular array used prior to PostgreSQL 17, the prefix tree stores identifiers an order of magnitude more compactly, and its total size can exceed 1 GB — the maximum size of a memory fragment allocated at one time.
[In explaining the columns of
pg_stat_progress_vacuumview: In v.17, some columns are removed (max_dead_tuplesandnum_dead_tuples), while some are added (max_dead_tuple_bytes,dead_tuple_bytes,num_dead_item_ids,indexes_total, andindexes_processed). The text explains some of these]
SELECT *
FROM pg_stat_progress_vacuum
\gx
−[ RECORD 1 ]−−−−−−−−+−−−−−−−−−−−−−−
pid | 14639
datid | 16391
datname | internals
relid | 16482
phase | scanning heap
heap_blks_total | 172414
heap_blks_scanned | 90517
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuple_bytes | 1048576
dead_tuple_bytes | 0
num_dead_item_ids | 0
indexes_total | 0
indexes_processed | 0
SELECT *
FROM pg_stat_progress_vacuum
\gx
−[ RECORD 1 ]−−−−−−−−+−−−−−−−−−−−−−−−−−−
pid | 14639
datid | 16391
datname | internals
relid | 16482
phase | vacuuming indexes
heap_blks_total | 172414
heap_blks_scanned | 32733
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuple_bytes | 1048576
dead_tuple_bytes | 1049600
num_dead_item_ids | 1898514
indexes_total | 1
indexes_processed | 0dead_tuple_bytesandnum_dead_item_ids— the volume and number of dead versions collected during this pass.
The storage efficiency can be estimated by the ratio of dead_tuple_bytes to num_dead_item_ids: in this example,
approximately half a byte is spent per single TID.
[From English book] The previous
UPDATEcommand has removed the visibility bit of the zero page, so the tuple that has an appropriatexminage in this page will be frozen. But the first page will be skipped altogether:=> VACUUM tfreeze; => SELECT * FROM heap_page('tfreeze',0,1); ctid | state | xmin | xmin_age | xmax −−−−−−−+−−−−−−−−−−−−−-−−+−−−−−−−+−−−−−−−−−−+−−−−−− (0,1) | redirect to 3 | | | (0,2) | normal | 856 f | 2 | 0 a (0,3) | normal | 857 c | 1 | 0 a (1,1) | normal | 856 c | 2 | 0 a (1,2) | normal | 856 c | 2 | 0 a (5 rows)[Now note the change from the new version - column
xmin, 2nd row isfand notc]
=> VACUUM tfreeze;
VACUUM
=> SELECT * FROM heap_page('tfreeze',0,1);
ctid | state | xmin | xmin_age | xmax
−−−−−−−+−−−−−−−−−−−−−−−+−−−−−−−-+−−−−−−−−−−+−−−−−−
(0,1) | redirect to 3 | | |
(0,2) | normal | 869 f | 2 | 0 a
(0,3) | normal | 870 f | 1 | 0 a
(1,1) | normal | 869 c | 2 | 0 a
(1,2) | normal | 869 c | 2 | 0 a
(5 rows)
Please note that on page zero, both row versions were frozen, even though the age of the xmin transaction in the
second version does not exceed the value of the vacuum_freeze_min_age parameter (default = 50,000,000).
Since the page has to be modified anyway, all row versions beyond the database horizon are frozen on it — this
practically does not increase the cost of the operation, but it can save effort in the future.
Aggressive freezing of all pages in a table is performed no more frequently than once every
vacuum_freeze_table_age - vacuum_freeze_min_age transactions (100 million transactions with default parameter values).
It will be triggered less often if vacuuming in non-aggressive mode can also advance the relfrozenxid value. In any
case, it must be kept in mind that too large a value for the vacuum_freeze_min_age parameter, instead of saving
resources, can lead to excessive triggering of aggressive freezing and an increase in overhead costs.
[
transaction_timeoutis new in v.17]
Situations can arise where long transactions are caused not by necessity, but by errors in the application or driver. If
the problem cannot be solved in a civilized way, the administrator has a couple of parameters at their disposal:
idle_in_transaction_session_timeout (default = 0 ms), which determines the maximum lifetime of an idle transaction,
and transaction_timeout (default = 0 ms), which limits the total duration of a transaction. After the specified time
has elapsed (if it is not equal to zero), the transaction is terminated.
[Previously] The PostgreSQL community has already started this major effort, but it will take a long time for the actual results to appear.
This extensive work is already underway in the community [1], and the debug_io_direct parameter has been added for
debugging, but practical results will not appear anytime soon.
[1] https://www.postgresql.org/message-id/flat/20210223100344.llw5an2aklengrmn%40alap3.anarazel.de
[Previously] Let’s restart the instance to clear its buffer cache:
postgres$ pg_ctl restart -l /home/postgres/logfile
Let's clear the buffer of the page (alternatively, the entire buffer cache could be cleared by restarting the instance):
=> SELECT pg_buffercache_evict(281);
pg_buffercache_evict
----------------------
t
(1 row)
[Before] Vacuuming strategy is used by the process of vacuuming when it performs a full table scan without taking the visibility map into account. The ring buffer is assigned 256 kB of RAM (32 standard pages).
Vacuuming strategy is applied by the vacuuming and analysis processes during a full table scan without considering
the visibility map. The size of the ring is determined by the vacuum_buffer_usage_limit parameter (default = 2MB),
but it can be overridden in the VACUUM or ANALYZE command.
[Before] Restart the server to clear the cache, as now it contains some heap pages that have been read during analysis.
postgres$ pg_ctl restart -l /home/postgres/logfileOnce the server is restarted, let’s read the whole table:
=> EXPLAIN (analyze, costs off, timing off, summary off) SELECT id FROM big; QUERY PLAN −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Seq Scan on big (actual rows=4097 loops=1) (1 row)
Let's clear the cache of the table pages read during the analysis, after which we will read the entire table:
=> SELECT pg_buffercache_evict(bufferid)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('big'::regclass);
=> EXPLAIN (analyze, costs off, timing off, summary off)
SELECT * FROM big;
QUERY PLAN
------------------------------------
Seq Scan on big (actual rows=4097 loops=1)
(1 row)
Detailed information regarding I/O passing through the buffer cache is accumulated in the pg_stat_io view. For
example, you can examine I/O volume broken down by buffer ring usage strategies (the normal context corresponds to
regular operations without a ring):
=> SELECT context, sum(reads) reads, sum(writes) writes,
sum(hits) hits, sum(evictions) evictions, sum(reuses) reuses
FROM pg_stat_io
WHERE object = 'relation'
GROUP BY context;
context | reads | writes | hits | evictions | reuses
-----------+--------+--------+----------+-----------+--------
vacuum | 658529 | 544152 | 168505 | 2649 | 653199
normal | 103844 | 373088 | 71936923 | 277869 |
bulkwrite | 0 | 0 | 8 | 2 | 0
bulkread | 256194 | 108324 | 94895 | 152228 | 105566
(4 rows)
In addition to the number of pages read and written (reads and writes), this shows the number of cache hits (hits) and cache evictions: "normal" evictions are accounted for in the evictions field, while evictions of pages from a buffer ring are in the reuses field.
[Context] Do not run such queries non-stop because the
pg_buffercacheextension locks the viewed buffers, even if only briefly.
Therefore, for monitoring purposes, it is better to use other functions of the extension — pg_buffercache_summary()
and pg_buffercache_usage_counts() — which require fewer locks and, because of this, allow quickliy assessing the
general (though perhaps slightly inconsistent) picture of what is happening.
The pg_stat_io view allows for obtaining data on I/O associated with temporary objects in local process caches:
=> SELECT context, sum(reads) reads, sum(writes) writes,
sum(hits) hits, sum(evictions) evictions, sum(reuses) reuses
FROM pg_stat_io
WHERE object = 'temp relation'
GROUP BY context;
context | reads | writes | hits | evictions | reuses
--------+-------+--------+------+-----------+--------
normal | 0 | 0 | 1 | 0 |
(1 row)
[Only the 2nd bullet point, unlogged sequences, is new in v.15]
The following actions are not logged:
- operations on
UNLOGGEDtables - operations on unlogged sequences (typically associated with unlogged tables).
- operations on temporary tables—since their lifetime is anyway limited by the session that spawns them
[Previously, it looked into
pg_waldumpdirectory, now it uses an extension]
You can look into the headers of the created log entries either with the pg_waldump utility or by using the
pg_walinspect extension:
=> CREATE EXTENSION pg_walinspect;
=> SELECT start_lsn, resource_manager, xid, record_type,
left(description,44) description, block_ref
FROM pg_get_wal_records_info('2/7CEE6290', '2/7CEE6300') \gx
-[ RECORD 1 ]----+-------------------------------------------
start_lsn | 2/7CEE6290
resource_manager | Heap
xid | 900
record_type | HOT_UPDATE
description | old_xmax: 900, old_off: 1, old_infobits: [],
block_ref | blkref #0: rel 1663/16391/16572 fork main blk 0
-[ RECORD 2 ]----+-------------------------------------------
start_lsn | 2/7CEE62D8
resource_manager | Transaction
xid | 900
record_type | COMMIT
description | 2025-01-12 14:45:30.734485+03
block_ref |
Here we see the headers of two records.
[Previously, it looked into
pg_waldumpdirectory, now it uses an extension][Also, in v.17, function
pg_get_wal_records_info()was introduced instead ofpg_get_wal_records_info_till_end_of_wal()]
Let's see how a checkpoint is reflected in the log. The CHECKPOINT_REDO record marks the start of a checkpoint, and
CHECKPOINT_ONLINE marks its completion:
=> SELECT start_lsn,
resource_manager,
record_type,
left(description,46)||'...' description
FROM pg_get_wal_records_info('2/7D7796D8', 'FFFFFFFF/FFFFFFFF')
-- pg_get_wal_records_info_till_end_of_wal up to v.16
WHERE record_type LIKE 'CHECKPOINT%'
\gx
-[ RECORD 1 ]----+----------------------------------------------
start_lsn | 2/7D7796D8
resource_manager | XLOG
record_type | CHECKPOINT_REDO
description | wal_level replica...
-[ RECORD 2 ]----+----------------------------------------------
start_lsn | 2/7D779730
resource_manager | XLOG
record_type | CHECKPOINT_ONLINE
description | redo 2/7D7796D8; tli 1; prev tli 1; fpw true; ...
Between these two positions in the log, other records generated by the continuing work of the server will be located.
But the most useful information for tuning purposes is the operation statistics of the checkpoint process in the
pg_stat_checkpointer view.
Prior to version 17, the checkpoint and background writer shared a single view, pg_stat_bgwriter, because previously
both tasks were performed by the background writer process; then, in version 9.2, the checkpoint was separated into a
distinct process, but the view remained shared for a long time.
=> SELECT * FROM pg_stat_checkpointer \gx
−[ RECORD 1 ]−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
num_timed | 0
num_requested | 26
restartpoints_timed | 0
restartpoints_req | 0
restartpoints_done | 0
write_time | 313461
sync_time | 21508
buffers_written | 101888
stats_reset | 2025−01−12 14:38:15.939618+03
Among other things, the view shows the number of checkpoints performed:
- num_timed — scheduled (upon reaching the
checkpoint_timeoutinterval); - num_requested — on demand (including upon reaching the
max_wal_sizelog volume).
A high value in the num_requested column (compared to num_timed) indicates that, in reality, checkpoints are
occurring more frequently than intended.
Fields starting with restartpoints are only relevant for replicas.
The pg_stat_bgwriter view shows background writer statistics:
=> SELECT * FROM pg_stat_bgwriter \gx
-[ RECORD 1 ]-----+------------------------------
buffers_clean | 85694
maxwritten_clean | 816
buffers_alloc | 493103
stats_reset | 2025-01-12 14:38:15.939618+03
For tuning purposes, the maxwritten_clean value is useful — this number shows how many times the background writer
process stopped a scan cycle because it exceeded bgwriter_lru_maxpages (default = 100).
In a well-tuned system, backend processes perform almost no writes; the bulk of writing is handled by the checkpoint
process (buffers_written) and the background writer (buffers_clean).
The most detailed information about I/O, broken down by process types (including background processes), can be obtained
from the pg_stat_io view:
=> SELECT backend_type,
sum(reads) reads, sum(writes) writes, sum(fsyncs) fsyncs
FROM pg_stat_io
GROUP BY backend_type
ORDER BY backend_type;
backend_type | reads | writes | fsyncs
---------------------+--------+--------+--------
autovacuum launcher | 5 | 0 | 0
autovacuum worker | 395549 | 292555 | 0
background worker | 4243 | 0 | 0
background writer | | 85694 | 0
checkpointer | | 101886 | 602
client backend | 631385 | 548682 | 0
slotsync worker | 0 | 0 | 0
standalone backend | 546 | 979 | 0
startup | 0 | 0 | 0
wal sender | 0 | 0 | 0
(10 rows)
There are a couple of situations where log files need to be read as well as written. The first is the obvious case of disaster recovery. The second arises when using streaming replication. The walsender process attempts to read WAL records from the WAL buffer, but if the replica fails to receive the log records while the required pages are still in the primary server's memory, the data will be read from disk. However, even in this case, access will be sequential, not random.
| Before v.17 | v.17+ |
|---|---|
![]() |
![]() |
Full page images in our case constitute a significant portion of the generated volume, which can be verified by looking
at statistics with the pg_waldump utility using the --stats flag or the pg_get_wal_stats() function of the
pg_walinspect extension.
In addition to the traditional PGLZ compression algorithm, two others are available: LZ4 and Zstandard (zstd). The LZ4
algorithm usually requires fewer CPU resources while showing approximately the same compression ratio. The Zstandard
algorithm combines dictionary compression with entropy coding; it puts a heavier load on the processor but compresses
data better.
=> ALTER SYSTEM SET wal_compression = zstd;
=> SELECT pg_reload_conf();
=> CHECKPOINT;
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn
---------------------------
2/84E33C08
(1 row)
postgres$ /usr/local/pgsql/bin/pgbench -t 20000 internals
=> SELECT pg_current_wal_insert_lsn();
pg_current_wal_insert_lsn
---------------------------
2/857D7498
(1 row)
Size of log entries with Zstandard compression:
=> SELECT pg_size_pretty('2/857D7498'::pg_lsn - '2/84E33C08'::pg_lsn);
pg_size_pretty
----------------
9870 kB
(1 row)
Conclusion: When there is a large number of full page images (due to checksums or full_page_writes, which is to
say almost always), it makes sense to use compression. Several algorithms allow you to find a compromise between
an additional CPU load and the reduction of log entry volume.
With the help of the already familiar pg_walinspect extension, let's look at the log contents.
=> SELECT start_lsn,
resource_manager AS rmgr,
record_type,
(regexp_match(block_ref, '[0-9]+\/[0-9]+\/[0-9]+'))[1] AS rel
FROM -- pg_get_wal_records_info_till_end_of_wal up to v.16
pg_get_wal_records_info('2/857D9460','FFFFFFFF/FFFFFFFF');
start_lsn | rmgr | record_type | rel
−−−−−−−−−−−−+−−−−−−−−−−−−−+−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−
2/857D9460 | Storage | CREATE |
2/857D9490 | Heap | UPDATE | 1663/16391/1259
2/857D9510 | Btree | INSERT_LEAF | 1663/16391/2662
2/857D9550 | Btree | INSERT_LEAF | 1663/16391/2663
2/857D9590 | Btree | INSERT_LEAF | 1663/16391/3455
2/857D95D0 | Transaction | COMMIT |
(6 rows)
The first record (CREATE) indicates the creation of a new file for the relation (since the table is effectively
overwritten during truncation).
Next are several records related to the system catalog. They reflect the modification of the pg_class table (UPDATE)
and the addition of records to three indexes built on this table (INSERT_LEAF). This can be verified by comparing the
file path shown in the rel column with the name of the relation.
Finally, there is a record for the transaction commit. The insertion of data into the table is not logged.
[Context: Section 14.2, Relation Extension Locks]
Previously: To minimize this risk, heap files are extended by several pages at once (in proportion to the number of processes awaiting the lock, but by not more than 512 pages per operation). An exception to this rule is B-tree index files, which are extended by one page at a time.
For efficiency, table files are expanded not by a single page, but by several at once (proportional to the number of processes waiting for a lock, but no more than 64 pages at a time)[*]. However, the expansion of B-tree-based index files occurs only one page at a time[**].
[*] backend/access/heap/hio.c, function RelationAddBlocks.
[**] backend/access/nbtree/nbtpage.c, function _bt_allocbuf.
[Context: Section 15.4, Monitoring Waits]
But processes can be waiting for other events too: IO, Client, IPC, and Extension. PG v.17 adds yet another event:
InjectionPoint — execution of code at an injection point (such as an event used for testing PostgreSQL and does not occur during operation).
Context: There are situations when a process simply does not perform any useful work. The following waits belong to this category:
- Activity — background processes in their main cycle;
- Timeout — a timer.
As a rule, such waits are "normal" and do not indicate any problems. However, for example, a SpinDelay event of the
Timeout type indicates a wait for a spin-lock.
[Context: Section 16.2, Simple Query Protocol] Planning the order of joins - query that does not specify any explicit joins for tables listed in the
FROMclause:SELECT ... FROM a, b, c, d, e WHERE ...Here the planner will have to consider all the possible pairs of joins.
Below is a graph showing the dependence of the memory amount required for planning such a query on the number of tables
in it (data obtained by the EXPLAIN command with the MEMORY parameter):
Planning time also approximately doubles with each added table.
[Context: Section 16.3, Extended Query Protocol] Parameter Binding Before a prepared statement gets executed, the actual parameter values have to be bound.
=> EXECUTE plane('733');
Perhaps the binding stage is illustrated more clearly when using the extended protocol directly in psql:
=> \bind '733'
=> SELECT * FROM aircrafts WHERE aircraft_code = $1;
aircraft_code | model | range
---------------+-----------------+-------
733 | Boeing 737-300 | 4200
(1 row)
[Context: Section 17.6, Statistics for Non-Scalar Data Types] Previously:
- For range types, PostgreSQL builds distribution histograms for range length and lower > and upper boundaries of the range. These histograms are used for estimating selectivity of various operations on these types, but the
pg_statsview does not display them.
- For non-empty ranges, histograms are collected for the distribution of lower and upper bounds (the
range_bounds_histogramarray) and length (therange_length_histogramarray). The fraction of empty ranges is noted separately (therange_empty_fraccolumn). This data is used for selectivity estimation of various operations with range types. Similar statistics are also used for multirange types.
_[Context: Section 18.1, Pluggable Storage Engines)
You can specify the engine to use when creating a table (
CREATE TABLE ... USING); otherwise, the default engine listed in thedefault_table_access_method(default =heap) parameter will be applied.
Subsequently, the engine can be changed (ALTER TABLE ... SET ACCESS METHOD), but, of course, at the cost of rewriting
all the data.
Context: The following core components can be used by all table access methods:
- transaction manager, including
ACIDand snapshot isolation support- buffer manager
I/OsubsystemTOAST- optimizer and executor
- index support
[Added in v.15]:
- the WAL, including support for specific operations using the Generic WAL Records mechanism [1] or Custom WAL Resource Managers [2];
[1] https://www.postgresql.org/docs/17/generic-wal.html
[2] https://www.postgresql.org/docs/17/custom-rmgr.html
[Context: Section 18.5, subsection "Non-Parallelizable Queries"] Previously: To check whether a query can be parallelized at all, you can temporarily switch on the
force_parallel_modeparameter.This parameter is deprecated in v.15, and renamed as
debug_parallel_queryin v.16. The text replaces the previous paragraph with the following:
To check whether a query can be parallelized in principle, you can temporarily enable the debug_parallel_query
parameter. In this case, the planner will build parallel plans in all instances where it is possible:
=> EXPLAIN SELECT * FROM flights;
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)
(1 row)
=> SET debug_parallel_query = on;
=> EXPLAIN SELECT * FROM flights;
QUERY PLAN
------------------------------------------------------------------------------
Gather (cost=1000.00..27259.37 rows=214867 width=63)
Workers Planned: 1
Single Copy: true
-> Seq Scan on flights (cost=0.00..4772.67 rows=214867 width=63)
(4 rows)
Now we understand that the planner considered a parallel plan but chose the more efficient sequential scan in this particular case.
[Context: Section 18.5, subsection "Parallel Restricted Queries"]
Note, however, that a CTE itself can be computed in the parallel mode if it turns out to be less expensive.
Correlated subqueries. Uncorrelated subqueries, evaluated once and represented in the plan by the InitPlan node,
can participate in a parallel plan:
=> EXPLAIN (costs off)
SELECT * FROM flights f1
WHERE f1.aircraft_code = (
SELECT '733' -- InitPlan 1
)
UNION ALL
SELECT * FROM flights f2
WHERE f2.aircraft_code = (
SELECT '320' -- InitPlan 2
);
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Gather
Workers Planned: 2
−> Parallel Append
−> Parallel Seq Scan on flights f2
Filter: ((aircraft_code)::text = (InitPlan 2).col1)
InitPlan 2
−> Result
−> Parallel Seq Scan on flights f1
Filter: ((aircraft_code)::text = (InitPlan 1).col1)
InitPlan 1
−> Result
(11 rows)
The Parallel Append node here implements the parallel execution of the UNION ALL command. Worker processes
distribute the subqueries among themselves (if there are more processes than subqueries, several can work on a single
task). Once all data from a subquery has been received, the freed-up process connects to another subquery until all data
has been merged.
[Context: Section 19.3, Indexing Engine Interface; under the subsection "Access Method Properties"]
There are actually many more properties, but many are not accessible at the SQL level.[1] Here are some of them:
-
CAN PRED LOCKS: Support for predicate locks. Missing in
spgistandbrinaccess methods; when scanning indexes of these types at the Serializable isolation level, predicate locks will be set on the entire table, as with a sequential scan, which will lead to a decrease in performance. -
CAN BUILD PARALLEL: The ability to build an index with multiple processes in parallel. Supported by
btreeandbrinaccess methods.- Parallel
brinwas added in v.17.
- Parallel
[1] include/access/amapi.h, struct IndexAmRoutine.
[Context: Section 20.3, Bitmap Scans]
The text is explaining "Bitmap Heap Scan," and how it uses prefetching to speed up the scan. The following explanation is added at the end:
Asynchronous prefetching is also used by some other internal processes:
- for index pages when heap rows are being deleted
- for heap pages during analysis (ANALYZE)
The prefetch depth is defined by the
maintenance_io_concurrency.
In addition to the two bulletpoints above, v.15 adds the following:
- When restoring, WAL pages are prefetched.
[Context: Section 22.1, Hash Joins]
Previously: Parallel hash joins are currently not supported for right and full joins.
As of v.16:
Parallel hash join is supported for all types of joins: not only for inner and left outer joins, but also for right and full outer joins.
_[Context: Section 23.3, Distinct Values and Grouping]
The Unique node can participate in a parallel plan, removing duplicates from the list received from worker processes:
=> EXPLAIN (costs off) SELECT DISTINCT fare_conditions
FROM ticket_flights;
QUERY PLAN
------------------------------------------------------
Unique
-> Gather Merge
Workers Planned: 2
-> Sort
Sort Key: fare_conditions
-> HashAggregate
Group Key: fare_conditions
-> Parallel Seq Scan on ticket_flights
(8 rows)
In this example, worker processes extract unique values in their data sets using hashing in the HashAggregate node; the general list, collected by the Gather node, is sorted in the Sort node and passed to the Unique node for final duplicate removal.
[Related; see Page 396 – v.17]
Along with B-trees, BRIN indexes for large tables can be created in parallel. The table is scanned by several
processes that calculate summary information for their respective portions of the data. This information is sorted by
zones (memory of size maintenance_work_mem is allocated), then the lead process aggregates it and forms the index.[1]
[1] backend/access/brin/brin.c, function brinbuild.




