PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"

dlenev · dlenev · commit be02aa6a0e1e · 2025-04-29T10:08:59.000+02:00
https://perconadev.atlassian.net/browse/PS-9703 Problem: -------- ALTER TABLE which rebuilds InnoDB table using INPLACE algorithm might sometimes lead to row loss if concurrent purge happens on the table being ALTERed. Analysis: --------- This issue was introduced in Upstream version 8.0.41 as unwanted side-effect of fixes for bug#115608 (PS-9144), in which similar problem is observed but in a different scenario, and bug#115511 (PS-9214). It was propageted to Percona Server 8.0.41-32, in which we opted for reverting our versions of fixes for PS-9144 and PS-9214 in favour of Upstream ones. New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned (this happens when switching to the next page) or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The problem described above occurs when we try to save and then restore position of cursor pointing to page supremum, before switching to the next page. In post-8.0.41 code this is done by simply calling btr_pcur_t::store_position()/restore_position() methods for cursor that point to supremum. In 8.0.42-based code this is done in PCursor::save_previous_user_record_as_last_processed() and PCursor::restore_to_first_unprocessed() pair of methods. However, this doesn't work correctly in scenario, when after we have saved cursor position and then committed mini-transaction/released latches on the current page the next page is merged into the current one (after purge removes some records from it). In this case the cursor position is still restored as pointing to page supremum, and thus rows which were moved over by merge are erroneously skipped. *** Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Our thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches. In post-8.0.41 code we simply do btr_pcur_t::store_position()/ restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'. *** Fix: ---- This patch solves the problem by working around the issue with saving/ restoring cursor pointing to supremum. Instead of storing position of supremum record PCursor::save_previous_user_record_as_last_processed() now stores the position of record that precedes it. And then PCursor::restore_to_first_unprocessed() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the supremum record at which cursor pointed originally. If this is not true, i.e. there is user record added to the page by the merge (or simple concurrent insert), we assume that this and following records are unprocessed. The caller of PCursor::restore_to_first_unprocessed() detects this situation by checking if cursor is positioned on supremum and handles by resuming processing from record under the cursor if not. *** Let us return to the above example to explain how fix works. PCursor::save_previous_user_record_as_last_processed() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::restore_to_first_unprocessed() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that saving/restoring resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records. *** Thanks to Bytedance team for bringing this issue to our attention! The test case for this bug is based on the one that they reported to the Upstream and later pointed to us.
diff --git a/mysql-test/suite/innodb/r/percona_alter_debug.result b/mysql-test/suite/innodb/r/percona_alter_debug.result
@@ -22,3 +22,91 @@ connection default;
 # Cleanup.
 SET DEBUG_SYNC= 'RESET';
 DROP TABLE t1;
+#
+# PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"/
+# Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
+#
+# Execution of ALTER TABLE INPLACE might sometimes result in row
+# loss when merge of pages in table's B-tree is happening concurrently.
+#
+# Rows of the table should be big enough to easily cause page split,
+# OTOH they should be small enough to be easily merged back as well.
+# In our case:
+#       5 * 3300 > available space for rows in 16k page,
+#       2 * 3300 < 8k which is merge threshold for 16k page.
+CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
+INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
+INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
+INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
+INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
+INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
+INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
+#
+# At this point leaf pages of B-tree for our table look like:
+#
+#              [1,2] - [3,4,5,7]
+#
+# The below insert will cause split of the second leaf page, so
+# we end up with the following picture of leaf pages after it:
+#
+#           [1,2] - [3,4] - [5,6,7]
+INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
+#
+# Delete marks middle row in the third leaf page as deleted.
+# Once purge is enabled, the row will be removed from the page
+# for real, causing the contents of the third leaf page to
+# be merged into second:
+#
+# [1,2] - [3,4] - [5,*6,7]  =>  [1,2] - [3,4,5,7]
+SET GLOBAL innodb_purge_stop_now = ON;
+DELETE FROM t1 WHERE id=6;
+connect con1, localhost, root,,;
+# Make ALTER TABLE to process all leaf pages as a single scan.
+SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
+# Ensure that we always release latches when moving from
+# page to page.
+SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
+SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
+# Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE;
+connection default;
+# Wait until we reach end of the first leaf page:
+# [1,2 <we are here>] - [3, 4] - ...
+# And signal ALTER TABLE to proceed to the next leaf page.
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+SET DEBUG_SYNC="now SIGNAL continue";
+# Wait until we reach end of the second leaf page:
+# [1,2] - [3,4 <we are here>] - [5,*6,7]
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+# Unleash the purge and wait till it completes, so row 6 is
+# removed for real and the third leaf page is merged into the
+# second one.
+SET GLOBAL innodb_purge_run_now = ON;
+#
+# We should end up with:
+#        [1,2] - [3,4, <we are here> 5,7]
+#
+# While before the fix we got:
+#        [1,2] - [3,4,5,7 <we are here>]
+#
+# Resume ALTER TABLE.
+SET DEBUG_SYNC="now SIGNAL continue";
+connection con1;
+# Reap ALTER TABLE.
+# Check that all rows are present in table (except deleted row 6).
+# Before the fix rows 5 and 7 were missing as well.
+SELECT id FROM t1;
+id
+1
+2
+3
+4
+5
+7
+# Clean up.
+SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
+SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
+SET DEBUG_SYNC='RESET';
+disconnect con1;
+connection default;
+DROP TABLE t1;
diff --git a/mysql-test/suite/innodb/t/percona_alter_debug.test b/mysql-test/suite/innodb/t/percona_alter_debug.test
@@ -36,4 +36,106 @@ SET DEBUG='-d,ddl_buf_add_two';
 SET DEBUG_SYNC= 'RESET';
 DROP TABLE t1;
 
+--echo #
+--echo # PS-9703 "Upstream 8.0.41 release does not fully fix PS-9144"/
+--echo # Bug#117436 "PCursor::move_to_next_block may skip records incorrectly".
+--echo #
+--echo # Execution of ALTER TABLE INPLACE might sometimes result in row
+--echo # loss when merge of pages in table's B-tree is happening concurrently.
+
+--echo #
+--echo # Rows of the table should be big enough to easily cause page split,
+--echo # OTOH they should be small enough to be easily merged back as well.
+--echo # In our case:
+--echo #       5 * 3300 > available space for rows in 16k page,
+--echo #       2 * 3300 < 8k which is merge threshold for 16k page.
+CREATE TABLE t1(id INT PRIMARY KEY, val VARCHAR(3300));
+INSERT INTO t1 VALUES (1, REPEAT('1', 3300));
+INSERT INTO t1 VALUES (2, REPEAT('2', 3300));
+INSERT INTO t1 VALUES (3, REPEAT('3', 3300));
+INSERT INTO t1 VALUES (4, REPEAT('4', 3300));
+INSERT INTO t1 VALUES (5, REPEAT('5', 3300));
+INSERT INTO t1 VALUES (7, REPEAT('7', 3300));
+
+--echo #
+--echo # At this point leaf pages of B-tree for our table look like:
+--echo #
+--echo #              [1,2] - [3,4,5,7]
+
+--echo #
+--echo # The below insert will cause split of the second leaf page, so
+--echo # we end up with the following picture of leaf pages after it:
+--echo #
+--echo #           [1,2] - [3,4] - [5,6,7]
+
+INSERT INTO t1 VALUES (6, REPEAT('6', 3300));
+
+--echo #
+--echo # Delete marks middle row in the third leaf page as deleted.
+--echo # Once purge is enabled, the row will be removed from the page
+--echo # for real, causing the contents of the third leaf page to
+--echo # be merged into second:
+--echo #
+--echo # [1,2] - [3,4] - [5,*6,7]  =>  [1,2] - [3,4,5,7]
+
+SET GLOBAL innodb_purge_stop_now = ON;
+DELETE FROM t1 WHERE id=6;
+
+--connect(con1, localhost, root,,)
+--echo # Make ALTER TABLE to process all leaf pages as a single scan.
+SET GLOBAL DEBUG="+d,parallel_reader_force_single_range";
+--echo # Ensure that we always release latches when moving from
+--echo # page to page.
+SET GLOBAL DEBUG="+d,pcursor_move_to_next_block_release_latches";
+SET DEBUG_SYNC="pcursor_move_to_next_block_latches_released SIGNAL latches_released WAIT_FOR continue EXECUTE 2";
+
+--echo # Send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+--send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
+
+--connection default
+--echo # Wait until we reach end of the first leaf page:
+--echo # [1,2 <we are here>] - [3, 4] - ...
+--echo # And signal ALTER TABLE to proceed to the next leaf page.
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+SET DEBUG_SYNC="now SIGNAL continue";
+
+--echo # Wait until we reach end of the second leaf page:
+--echo # [1,2] - [3,4 <we are here>] - [5,*6,7]
+SET DEBUG_SYNC="now WAIT_FOR latches_released";
+
+--echo # Unleash the purge and wait till it completes, so row 6 is
+--echo # removed for real and the third leaf page is merged into the
+--echo # second one.
+SET GLOBAL innodb_purge_run_now = ON;
+--source include/wait_innodb_all_purged.inc
+
+--echo #
+--echo # We should end up with:
+--echo #        [1,2] - [3,4, <we are here> 5,7]
+--echo #
+--echo # While before the fix we got:
+--echo #        [1,2] - [3,4,5,7 <we are here>]
+--echo #
+--echo # Resume ALTER TABLE.
+SET DEBUG_SYNC="now SIGNAL continue";
+
+--connection con1
+--echo # Reap ALTER TABLE.
+--reap
+
+--echo # Check that all rows are present in table (except deleted row 6).
+--echo # Before the fix rows 5 and 7 were missing as well.
+SELECT id FROM t1;
+
+--echo # Clean up.
+SET GLOBAL DEBUG="-d,pcursor_move_to_next_block_release_latches";
+SET GLOBAL DEBUG="-d,parallel_reader_force_single_range";
+SET DEBUG_SYNC='RESET';
+
+--disconnect con1
+--source include/wait_until_disconnected.inc
+
+--connection default
+DROP TABLE t1;
+
 --disable_connect_log
diff --git a/storage/innobase/row/row0pread.cc b/storage/innobase/row/row0pread.cc
@@ -232,7 +232,7 @@ class PCursor {
 
   /** This method must be called after all records on a page are processed and
   cursor is positioned at supremum. Under this assumption, it stores the
-  position BTR_PCUR_AFTER the last user record on the page.
+  position of the last user record on the page.
   This method must be paired with restore_to_first_unprocessed() to restore to
   a record which comes right after the value of the stored last processed
   record @see restore_to_first_unprocessed for details. */
@@ -434,13 +434,24 @@ void PCursor::restore_to_last_processed_user_record() noexcept {
 void PCursor::save_previous_user_record_as_last_processed() noexcept {
   ut_a(m_pcur->is_after_last_on_page());
   ut_ad(m_read_level == 0);
+  /*
+    Instead of simply taking savepoint for cursor pointing to supremum we
+    are doing one step back. This is necessary to prevent situation when
+    concurrent merge from the next page, which happens after we commit
+    mini-transaction/release latches, moves records over supremum to the
+    current page. In this case the optimistic restore of cursor pointing
+    to supremum will result in cursor pointing to supremum, which means
+    moved records will be incorrectly skipped by the scan.
+
+    So, effectively, we are saving position of last user record which we
+    have processed/which corresponds to last key value we have processed!
+  */
+  m_pcur->move_to_prev_on_page();
   m_pcur->store_position(m_mtr);
-  ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
   m_mtr->commit();
 }
 
 void PCursor::restore_to_first_unprocessed() noexcept {
-  ut_a(m_pcur->m_rel_pos == BTR_PCUR_AFTER);
   ut_ad(m_read_level == 0);
   m_mtr->start();
   m_mtr->set_log_mode(MTR_LOG_NO_REDO);
@@ -449,13 +460,21 @@ void PCursor::restore_to_first_unprocessed() noexcept {
   /* Restored cursor is positioned on the page at the level intended before */
   ut_ad(m_read_level == btr_page_get_level(m_pcur->get_page()));
 
-  /* The BTR_PCUR_IS_POSITIONED_OPTIMISTIC only happens in case of a successful
-  optimistic restoration in which the cursor points to a user record after
-  restoration. But, in save_previous_user_record_as_last_processed() the cursor
-  pointed to SUPREMUM before calling m_ptr->store_position(mtr), so it would
-  also point there if optimistic restoration succeeded, which is not a user
-  record. */
-  ut_ad(m_pcur->m_pos_state != BTR_PCUR_IS_POSITIONED_OPTIMISTIC);
+  /*
+    The cursor points to last processed record, or, if it has been purged
+    meanwhile, its closest non-purged predecessor.
+    By moving to the successor of the saved record we position the cursor
+    either to supremum record (which means we restored the original cursor
+    position and can continue switch to the next page as usual) or to
+    some user record which our scan have not processed yet (for example,
+    this record might have been moved from the next page due to page
+    merge or simply inserted to our page concurrently).
+
+    The latter case is detected by caller by doing !is_after_last_on_page()
+    check and instead of doing switch to the next page we continue processing
+    from the restored user record.
+  */
+  m_pcur->move_to_next_on_page();
 }
 
 bool PCursor::restore_to_largest_le_position_saved() noexcept {
@@ -1326,6 +1345,9 @@ dberr_t Parallel_reader::Scan_ctx::create_ranges(const Scan_range &scan_range,
     start = nullptr;
 
     page_cur_move_to_next(&page_cursor);
+
+    DBUG_EXECUTE_IF("parallel_reader_force_single_range",
+                    page_cur_set_after_last(block, &page_cursor););
   }
 
   savepoints.push_back(savepoint);