Skip to content

Commit be02aa6

Browse files
committed
PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"
https://perconadev.atlassian.net/browse/PS-9703 Problem: -------- ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes lead to row loss if concurrent purge happens on the table being ALTERed. Analysis: --------- This issue was introduced in Upstream version 8.0.41 as unwanted side-effect of fixes for bug#115608 (PS-9144), in which similar problem is observed but in a different scenario, and bug#115511 (PS-9214). It was propageted to Percona Server 8.0.41-32, in which we opted for reverting our versions of fixes for PS-9144 and PS-9214 in favour of Upstream ones. New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned (this happens when switching to the next page) or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The problem described above occurs when we try to save and then restore position of cursor pointing to page supremum, before switching to the next page. In post-8.0.41 code this is done by simply calling btr_pcur_t::store_position()/restore_position() methods for cursor that point to supremum. In 8.0.42-based code this is done in PCursor::save_previous_user_record_as_last_processed() and PCursor::restore_to_first_unprocessed() pair of methods. However, this doesn't work correctly in scenario, when after we have saved cursor position and then committed mini-transaction/released latches on the current page the next page is merged into the current one (after purge removes some records from it). In this case the cursor position is still restored as pointing to page supremum, and thus rows which were moved over by merge are erroneously skipped. *** Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Our thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches. In post-8.0.41 code we simply do btr_pcur_t::store_position()/ restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'. *** Fix: ---- This patch solves the problem by working around the issue with saving/ restoring cursor pointing to supremum. Instead of storing position of supremum record PCursor::save_previous_user_record_as_last_processed() now stores the position of record that precedes it. And then PCursor::restore_to_first_unprocessed() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the supremum record at which cursor pointed originally. If this is not true, i.e. there is user record added to the page by the merge (or simple concurrent insert), we assume that this and following records are unprocessed. The caller of PCursor::restore_to_first_unprocessed() detects this situation by checking if cursor is positioned on supremum and handles by resuming processing from record under the cursor if not. *** Let us return to the above example to explain how fix works. PCursor::save_previous_user_record_as_last_processed() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::restore_to_first_unprocessed() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that saving/restoring resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records. *** Thanks to Bytedance team for bringing this issue to our attention! The test case for this bug is based on the one that they reported to the Upstream and later pointed to us.
1 parent 512384a commit be02aa6

File tree

3 files changed

+222
-10
lines changed

3 files changed

+222
-10
lines changed

mysql-test/suite/innodb/r/percona_alter_debug.result

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,91 @@ connection default;
2222
# Cleanup.
2323
SET DEBUG_SYNC= 'RESET';
2424
DROP TABLE t1;
25+
#
26+
# PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"/
27+
# Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
28+
#
29+
# Execution of ALTER TABLE INPLACE might sometimes result in row
30+
# loss when merge of pages in table's B-tree is happening concurrently.
31+
#
32+
# Rows of the table should be big enough to easily cause page split,
33+
# OTOH they should be small enough to be easily merged back as well.
34+
# In our case:
35+
# 5 * 3300 > available space for rows in 16k page,
36+
# 2 * 3300 < 8k which is merge threshold for 16k page.
37+
CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
38+
INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
39+
INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
40+
INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
41+
INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
42+
INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
43+
INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
44+
#
45+
# At this point leaf pages of B-tree for our table look like:
46+
#
47+
# [1,2] - [3,4,5,7]
48+
#
49+
# The below insert will cause split of the second leaf page, so
50+
# we end up with the following picture of leaf pages after it:
51+
#
52+
# [1,2] - [3,4] - [5,6,7]
53+
INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
54+
#
55+
# Delete marks middle row in the third leaf page as deleted.
56+
# Once purge is enabled, the row will be removed from the page
57+
# for real, causing the contents of the third leaf page to
58+
# be merged into second:
59+
#
60+
# [1,2] - [3,4] - [5,*6,7] => [1,2] - [3,4,5,7]
61+
SET GLOBAL innodb_purge_stop_now = ON;
62+
DELETE FROM t1 WHERE id=6;
63+
connect con1, localhost, root,,;
64+
# Make ALTER TABLE to process all leaf pages as a single scan.
65+
SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
66+
# Ensure that we always release latches when moving from
67+
# page to page.
68+
SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
69+
SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
70+
# Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
71+
ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE;
72+
connection default;
73+
# Wait until we reach end of the first leaf page:
74+
# [1,2 <we are here>] - [3, 4] - ...
75+
# And signal ALTER TABLE to proceed to the next leaf page.
76+
SET DEBUG_SYNC="now WAIT_FOR latches_released";
77+
SET DEBUG_SYNC="now SIGNAL continue";
78+
# Wait until we reach end of the second leaf page:
79+
# [1,2] - [3,4 <we are here>] - [5,*6,7]
80+
SET DEBUG_SYNC="now WAIT_FOR latches_released";
81+
# Unleash the purge and wait till it completes, so row 6 is
82+
# removed for real and the third leaf page is merged into the
83+
# second one.
84+
SET GLOBAL innodb_purge_run_now = ON;
85+
#
86+
# We should end up with:
87+
# [1,2] - [3,4, <we are here> 5,7]
88+
#
89+
# While before the fix we got:
90+
# [1,2] - [3,4,5,7 <we are here>]
91+
#
92+
# Resume ALTER TABLE.
93+
SET DEBUG_SYNC="now SIGNAL continue";
94+
connection con1;
95+
# Reap ALTER TABLE.
96+
# Check that all rows are present in table (except deleted row 6).
97+
# Before the fix rows 5 and 7 were missing as well.
98+
SELECT id FROM t1;
99+
id
100+
1
101+
2
102+
3
103+
4
104+
5
105+
7
106+
# Clean up.
107+
SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
108+
SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
109+
SET DEBUG_SYNC='RESET';
110+
disconnect con1;
111+
connection default;
112+
DROP TABLE t1;

mysql-test/suite/innodb/t/percona_alter_debug.test

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,106 @@ SET DEBUG='-d,ddl_buf_add_two';
3636
SET DEBUG_SYNC= 'RESET';
3737
DROP TABLE t1;
3838

39+
--echo #
40+
--echo # PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"/
41+
--echo # Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
42+
--echo #
43+
--echo # Execution of ALTER TABLE INPLACE might sometimes result in row
44+
--echo # loss when merge of pages in table's B-tree is happening concurrently.
45+
46+
--echo #
47+
--echo # Rows of the table should be big enough to easily cause page split,
48+
--echo # OTOH they should be small enough to be easily merged back as well.
49+
--echo # In our case:
50+
--echo # 5 * 3300 > available space for rows in 16k page,
51+
--echo # 2 * 3300 < 8k which is merge threshold for 16k page.
52+
CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
53+
INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
54+
INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
55+
INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
56+
INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
57+
INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
58+
INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
59+
60+
--echo #
61+
--echo # At this point leaf pages of B-tree for our table look like:
62+
--echo #
63+
--echo # [1,2] - [3,4,5,7]
64+
65+
--echo #
66+
--echo # The below insert will cause split of the second leaf page, so
67+
--echo # we end up with the following picture of leaf pages after it:
68+
--echo #
69+
--echo # [1,2] - [3,4] - [5,6,7]
70+
71+
INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
72+
73+
--echo #
74+
--echo # Delete marks middle row in the third leaf page as deleted.
75+
--echo # Once purge is enabled, the row will be removed from the page
76+
--echo # for real, causing the contents of the third leaf page to
77+
--echo # be merged into second:
78+
--echo #
79+
--echo # [1,2] - [3,4] - [5,*6,7] => [1,2] - [3,4,5,7]
80+
81+
SET GLOBAL innodb_purge_stop_now = ON;
82+
DELETE FROM t1 WHERE id=6;
83+
84+
--connect(con1, localhost, root,,)
85+
--echo # Make ALTER TABLE to process all leaf pages as a single scan.
86+
SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
87+
--echo # Ensure that we always release latches when moving from
88+
--echo # page to page.
89+
SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
90+
SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
91+
92+
--echo # Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
93+
--send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
94+
95+
--connection default
96+
--echo # Wait until we reach end of the first leaf page:
97+
--echo # [1,2 <we are here>] - [3, 4] - ...
98+
--echo # And signal ALTER TABLE to proceed to the next leaf page.
99+
SET DEBUG_SYNC="now WAIT_FOR latches_released";
100+
SET DEBUG_SYNC="now SIGNAL continue";
101+
102+
--echo # Wait until we reach end of the second leaf page:
103+
--echo # [1,2] - [3,4 <we are here>] - [5,*6,7]
104+
SET DEBUG_SYNC="now WAIT_FOR latches_released";
105+
106+
--echo # Unleash the purge and wait till it completes, so row 6 is
107+
--echo # removed for real and the third leaf page is merged into the
108+
--echo # second one.
109+
SET GLOBAL innodb_purge_run_now = ON;
110+
--source include/wait_innodb_all_purged.inc
111+
112+
--echo #
113+
--echo # We should end up with:
114+
--echo # [1,2] - [3,4, <we are here> 5,7]
115+
--echo #
116+
--echo # While before the fix we got:
117+
--echo # [1,2] - [3,4,5,7 <we are here>]
118+
--echo #
119+
--echo # Resume ALTER TABLE.
120+
SET DEBUG_SYNC="now SIGNAL continue";
121+
122+
--connection con1
123+
--echo # Reap ALTER TABLE.
124+
--reap
125+
126+
--echo # Check that all rows are present in table (except deleted row 6).
127+
--echo # Before the fix rows 5 and 7 were missing as well.
128+
SELECT id FROM t1;
129+
130+
--echo # Clean up.
131+
SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
132+
SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
133+
SET DEBUG_SYNC='RESET';
134+
135+
--disconnect con1
136+
--source include/wait_until_disconnected.inc
137+
138+
--connection default
139+
DROP TABLE t1;
140+
39141
--disable_connect_log

storage/innobase/row/row0pread.cc

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ class PCursor {
232232

233233
/** This method must be called after all records on a page are processed and
234234
cursor is positioned at supremum. Under this assumption, it stores the
235-
position BTR_PCUR_AFTER the last user record on the page.
235+
position of the last user record on the page.
236236
This method must be paired with restore_to_first_unprocessed() to restore to
237237
a record which comes right after the value of the stored last processed
238238
record @see restore_to_first_unprocessed for details. */
@@ -434,13 +434,24 @@ void PCursor::restore_to_last_processed_user_record() noexcept {
434434
void PCursor::save_previous_user_record_as_last_processed() noexcept {
435435
ut_a(m_pcur->is_after_last_on_page());
436436
ut_ad(m_read_level == 0);
437+
/*
438+
Instead of simply taking savepoint for cursor pointing to supremum we
439+
are doing one step back. This is necessary to prevent situation when
440+
concurrent merge from the next page, which happens after we commit
441+
mini-transaction/release latches, moves records over supremum to the
442+
current page. In this case the optimistic restore of cursor pointing
443+
to supremum will result in cursor pointing to supremum, which means
444+
moved records will be incorrectly skipped by the scan.
445+
446+
So, effectively, we are saving position of last user record which we
447+
have processed/which corresponds to last key value we have processed!
448+
*/
449+
m_pcur->move_to_prev_on_page();
437450
m_pcur->store_position(m_mtr);
438-
ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
439451
m_mtr->commit();
440452
}
441453

442454
void PCursor::restore_to_first_unprocessed() noexcept {
443-
ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
444455
ut_ad(m_read_level == 0);
445456
m_mtr->start();
446457
m_mtr->set_log_mode(MTR_LOG_NO_REDO);
@@ -449,13 +460,21 @@ void PCursor::restore_to_first_unprocessed() noexcept {
449460
/* Restored cursor is positioned on the page at the level intended before */
450461
ut_ad(m_read_level == btr_page_get_level(m_pcur->get_page()));
451462

452-
/* The BTR_PCUR_IS_POSITIONED_OPTIMISTIC only happens in case of a successful
453-
optimistic restoration in which the cursor points to a user record after
454-
restoration. But, in save_previous_user_record_as_last_processed() the cursor
455-
pointed to SUPREMUM before calling m_ptr->store_position(mtr), so it would
456-
also point there if optimistic restoration succeeded, which is not a user
457-
record. */
458-
ut_ad(m_pcur->m_pos_state != BTR_PCUR_IS_POSITIONED_OPTIMISTIC);
463+
/*
464+
The cursor points to last processed record, or, if it has been purged
465+
meanwhile, its closest non-purged predecessor.
466+
By moving to the successor of the saved record we position the cursor
467+
either to supremum record (which means we restored the original cursor
468+
position and can continue switch to the next page as usual) or to
469+
some user record which our scan have not processed yet (for example,
470+
this record might have been moved from the next page due to page
471+
merge or simply inserted to our page concurrently).
472+
473+
The latter case is detected by caller by doing !is_after_last_on_page()
474+
check and instead of doing switch to the next page we continue processing
475+
from the restored user record.
476+
*/
477+
m_pcur->move_to_next_on_page();
459478
}
460479

461480
bool PCursor::restore_to_largest_le_position_saved() noexcept {
@@ -1326,6 +1345,9 @@ dberr_t Parallel_reader::Scan_ctx::create_ranges(const Scan_range &scan_range,
13261345
start = nullptr;
13271346

13281347
page_cur_move_to_next(&page_cursor);
1348+
1349+
DBUG_EXECUTE_IF("parallel_reader_force_single_range",
1350+
page_cur_set_after_last(block, &page_cursor););
13291351
}
13301352

13311353
savepoints.push_back(savepoint);

0 commit comments

Comments
 (0)