Skip to content

Commit e331f9f

Browse files
committed
PS-9214 : Alter table online results in "duplicate key" error on the primary key (only index).
https://perconadev.atlassian.net/browse/PS-9214 Problem: -------- ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally might fail with unwarranted duplicate primary key error if there are concurrent insertions into the table, even though these insertions do not icause any PK conflict. Analysis: --------- New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread case. This implementation iterates over all the rows in the table, in general case, handling different subtrees of a B-tree in different threads. This iteration over table rows needs to be paused, from time to time, to commit InnoDB MTR/ release page latches it holds. This is necessary to give a way to concurrent actions on the B-tree scanned or before flushing rows of new version of table from in-memory buffer to the B-tree. In order to resume iteration after such pause persistent cursor position saved before pause is restored. The cause of the problem described above lies in how PCursor::savepoint() and PCursor::resume() methods perform this saving and restore of cursor position. Instead of storing position of current record pointed by cursor savepoint() stores the position of record that precedes it, and then resume() does restore in two steps - 1) restores position of this preceding record (or its closest precedessor if it was purged meanwhile) and then 2) moves one step forward assuming that will get to the record at which cursor pointed originally. Such approach makes sense when we try to save/restore cursor pointing to page's supremum record before switching to a new page, as it allows to avoid problems with records being skipped when the next page is merged into the current one while latches are released (so records which we have not scanned yet are moved over supremum record, but not over the record which originally preceded supremum). *** Let us take look at an example. Let us assume that we have two pages p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup]. Out thread which is one of the parallel ALTER TABLE worker threads has completed scan of p1, so its cursor positioned on p1:'sup' record. Now it needs to switch to page p2, but also give a way to threads concurrently updating the table. So it needs to make cursor savepoint, commit mini-transaction and release the latches. If naive approach to making savepoint is used and we simply do btr_pcur_t::store_position()/restore_position() with the cursor positioned on p1 : 'sup' record, then the following might happen: concurrent purge on page p2 might delete some record from it (e.g. 'f') and decide to merge of this page into the page p1. If this happens while latches are released this merge would go through and and resulting in page p1 with the following contents p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't be invalidated (one can say that savepoints for sup and inf are not safe against concurrent merges in this respect) and after restoration of cursor the iteration will continue, on the next page, skipping records 'd' and 'e'. With non-naive approach implemented at the moment, PCursor::savepoint() does a step back before calling btr_pcur_t::store_position(), so for cursor positioned on p1 : 'sup' it is actually position corresponding to p1 : 'c' what is saved. If the merge happens when latches are released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position() gets cursor pointing to p1 : 'c' as result, and then it tries to compensate for step-back in PCursor::savepoint() and moves cursor one step forward making it to point to p1 : 'd'. Code which does scanning detects the situation that we savepoint()/resume() resulted in jump from supremum record to user record and resume iteration from p1 : 'd' without skipping any records. *** However, it is not necessary and becomes problematic we try to save/restore cursor pointing to user record from within record processing callback. In this case record which position we are trying to save when savepoint() method is called can be considered already processed as corresponding value already will be inserted into output buffer soon after restore. When a concurrent insert adds a new record between the record which position we have inteded to save by calling savepoint() and its precedessor which position this call stored internally, the later call to resume() will position cursor at this newly inserted record. This will lead to the resumed scan revisiting original record once again. As result the code will attempt to add this original record into the output buffer one more time and get duplicate key error. *** Let us take a look at an example once again. Let us assume that parallel ALTER TABLE thread is scanning page p1 with the following contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c' (so cursor points to p1: 'c') and decides that it needs take savepoint/ commit mini-transaction and release latches in order to flush in-memory buffer with new versions of the records. PCursor::savepoint() does a step back and calls btr_pcur_t::store_position() which saves position corresponding to p1 : 'b'. After the latches are released concurrent insert might happen into the page adding record 'b1', between records 'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup]. After that PCursor::resume() will call restore_position() which will restore cursor pointing to p1 : 'b' and then will try to compensate for step-back in savepoint() by moving cursor one step forward to p1 : 'b1'. The scanning code will continue its iteration using this cursor, by moving to the next record - p1 :'c' and trying to process it once again, resulting in duplicate key error. *** Fix: --- This patch solves the problem by adjusting PCursors::savepoint()/resume() logic not to do this step back on save/step forward on restore if we are trying to save/restore cursor pointing to a user record (in which case it is not necessary). This process still used when we are trying to save/ restore cursor pointing to page infimum (where it is useful). It also adds some comments explaining how this code works and a few debug asserts enforcing its invariants.
1 parent d9d50ad commit e331f9f

File tree

3 files changed

+166
-14
lines changed

3 files changed

+166
-14
lines changed

mysql-test/suite/innodb/r/percona_alter_debug.result

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,27 @@ connection default;
3131
# Cleanup.
3232
SET DEBUG_SYNC= 'RESET';
3333
DROP TABLE t1;
34+
#
35+
# Test for PS-9214 : Alter table online results in "duplicate key"
36+
# error on the primary key (only index).
37+
#
38+
CREATE TABLE t1 (pk CHAR(5) PRIMARY KEY);
39+
INSERT INTO t1 VALUES ('aaaaa'), ('bbbbb'), ('ccccc'), ('ddddd'), ('eeeee');
40+
connect con1, localhost, root,,;
41+
SET DEBUG='+d,ddl_buf_add_two';
42+
SET DEBUG_SYNC='ddl_bulk_inserter_latches_released SIGNAL latches_released WAIT_FOR go';
43+
# Send ALTER TABLE INPLACE which rebuilds table.
44+
ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE;
45+
connection default;
46+
SET DEBUG_SYNC='now WAIT_FOR latches_released';
47+
INSERT INTO t1 VALUES ('ccaaa');
48+
SET DEBUG_SYNC='now SIGNAL go';
49+
connection con1;
50+
# Reap ALTER TABLE
51+
# Before fix it failed with duplicate key error.
52+
SET DEBUG='-d,ddl_buf_add_two';
53+
disconnect con1;
54+
connection default;
55+
# Cleanup.
56+
SET DEBUG_SYNC= 'RESET';
57+
DROP TABLE t1;

mysql-test/suite/innodb/t/percona_alter_debug.test

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Tests in this file use both debug-only facilities as well as debug sync point.
22
--source include/have_debug.inc
33
--source include/have_debug_sync.inc
4+
--enable_connect_log
45

56
--echo #
67
--echo # Test for PS-9144 : Missing rows after ALGORITHM=INPLACE ALTER under
78
--echo # same workload as PS-9092.
89
--echo #
9-
--enable_connect_log
1010
CREATE TABLE t1 (pk CHAR(5) PRIMARY KEY);
1111
INSERT INTO t1 VALUES ('aaaaa'), ('bbbbb'), ('bbbcc'), ('ccccc'), ('ddddd'), ('eeeee');
1212

@@ -41,4 +41,38 @@ SET DEBUG='-d,ddl_buf_add_two';
4141
--echo # Cleanup.
4242
SET DEBUG_SYNC= 'RESET';
4343
DROP TABLE t1;
44+
45+
--echo #
46+
--echo # Test for PS-9214 : Alter table online results in "duplicate key"
47+
--echo # error on the primary key (only index).
48+
--echo #
49+
CREATE TABLE t1 (pk CHAR(5) PRIMARY KEY);
50+
INSERT INTO t1 VALUES ('aaaaa'), ('bbbbb'), ('ccccc'), ('ddddd'), ('eeeee');
51+
52+
--connect (con1, localhost, root,,)
53+
SET DEBUG='+d,ddl_buf_add_two';
54+
SET DEBUG_SYNC='ddl_bulk_inserter_latches_released SIGNAL latches_released WAIT_FOR go';
55+
--echo # Send ALTER TABLE INPLACE which rebuilds table.
56+
--send ALTER TABLE t1 ENGINE=InnoDB, ALGORITHM=INPLACE
57+
58+
--connection default
59+
SET DEBUG_SYNC='now WAIT_FOR latches_released';
60+
INSERT INTO t1 VALUES ('ccaaa');
61+
SET DEBUG_SYNC='now SIGNAL go';
62+
63+
--connection con1
64+
--echo # Reap ALTER TABLE
65+
--echo # Before fix it failed with duplicate key error.
66+
--reap
67+
68+
SET DEBUG='-d,ddl_buf_add_two';
69+
70+
--disconnect con1
71+
--source include/wait_until_disconnected.inc
72+
73+
--connection default
74+
--echo # Cleanup.
75+
SET DEBUG_SYNC= 'RESET';
76+
DROP TABLE t1;
77+
4478
--disable_connect_log

storage/innobase/row/row0pread.cc

Lines changed: 107 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -251,12 +251,13 @@ class PCursor {
251251
return m_pcur->get_page_cur();
252252
}
253253

254-
/** Restore from a saved position.
254+
/** Restore from a saved position. Sets position the next user record
255+
in the tree if it is cursor pointing to supremum that was saved.
255256
@return DB_SUCCESS or error code. */
256257
[[nodiscard]] dberr_t restore_from_savepoint() noexcept;
257258

258-
/** Move to the first user rec on the restored page. */
259-
[[nodiscard]] dberr_t move_to_user_rec() noexcept;
259+
/** Move to the first user rec on the next page. */
260+
[[nodiscard]] dberr_t move_to_next_block_user_rec() noexcept;
260261

261262
/** @return true if cursor is after last on page. */
262263
[[nodiscard]] bool is_after_last_on_page() const noexcept {
@@ -276,6 +277,9 @@ class PCursor {
276277
/** Level where the cursor is positioned or need to be positioned in case of
277278
restore. */
278279
size_t m_read_level{};
280+
281+
/** Indicates that savepoint created for cursor corresponds to supremum. */
282+
bool m_supremum_saved{};
279283
};
280284

281285
buf_block_t *Parallel_reader::Scan_ctx::block_get_s_latched(
@@ -293,9 +297,42 @@ buf_block_t *Parallel_reader::Scan_ctx::block_get_s_latched(
293297
}
294298

295299
void PCursor::savepoint() noexcept {
296-
/* Store the cursor position on the previous user record on the page. */
297-
m_pcur->move_to_prev_on_page();
300+
/* The below code will not work correctly in all cases if we are to take
301+
savepoint of cursor pointing to infimum.
302+
303+
If savepoint is taken for such a cursor, in optimistic restore case we
304+
might not detect situation when after mini-trancation commit/latches
305+
release concurrent merge from the previous page or concurrent update of
306+
the last record on the previous page, which doesn't fit into it, has
307+
moved records over infimum (see MySQL bugs #114248 and #115518).
308+
As result scan will revisit these records, which can cause, for example,
309+
ALTER TABLE failing with a duplicate key error.
310+
311+
Luckily, at the moment, savepoint is taken only when cursor points
312+
to a user record (when we do save/restore from Parallel_reader::Ctx::m_f
313+
callback which processes records) or to the supremum (when we do save/
314+
restore from Parallel_reader::m_finish_callback end-of-page callback or
315+
before switching to next page in move_to_next_block()). */
316+
ut_ad(!m_pcur->is_before_first_on_page());
317+
318+
/* If we are taking savepoint for cursor pointing to supremum we are doing
319+
one step back. This is necessary to prevent situation when concurrent
320+
merge from the next page, which happens after we commit mini-transaction/
321+
release latches moves records over supremum to the current page.
322+
In this case the optimistic restore of cursor pointing to supremum will
323+
result in cursor pointing to supremum, which means moved records will
324+
be incorrectly skipped by the scan. */
325+
m_supremum_saved = m_pcur->is_after_last_on_page();
326+
if (m_supremum_saved) {
327+
m_pcur->move_to_prev_on_page();
328+
// Only root page can be empty and we are not supposed to get here
329+
// in this case.
330+
ut_ad(!m_pcur->is_before_first_on_page());
331+
}
298332

333+
/* Thanks to the above we can say that we are saving position of last user
334+
record which have processed/ which corresponds to last key value we have
335+
processed. */
299336
m_pcur->store_position(m_mtr);
300337

301338
m_mtr->commit();
@@ -307,17 +344,53 @@ void PCursor::resume() noexcept {
307344
m_mtr->set_log_mode(MTR_LOG_NO_REDO);
308345

309346
/* Restore position on the record, or its predecessor if the record
310-
was purged meanwhile. */
347+
was purged meanwhile. In extreme case this can be infimum record of
348+
the first page in the tree. However, this can never be a supremum
349+
record on a page, since we always move cursor one step back when
350+
savepoint() is called for infimum record.
351+
352+
We can be in either of the two situations:
353+
354+
1) We do save/restore from Parallel_reader:::Ctx::m_f callback for
355+
record processing. In this case we saved position of user-record for
356+
which callback was invoked originally, and which we already consider
357+
as processed. In theory, such record is not supposed be purged as our
358+
transaction holds read view on it. But in practice, it doesn't matter
359+
- even if the record was purged then our cursor will still point to
360+
last processed user record in the tree after its restoration.
361+
And Parallel_reader:::Ctx::m_f callback execution is always followed by
362+
page_cur_move_to_next() call, which moves our cursor to the next record
363+
and this record has not been processed yet.
364+
365+
2) We do save/restore from the Parallel_reader::m_finish_callback
366+
end-of-page callback or from move_to_next_block() before switching
367+
the page. In this case our cursor was positioned on supremum record
368+
before savepoint() call.
369+
Th savepoint() call raised m_supremum_saved flag and saved the position
370+
of the user record which preceded the supremum, i.e. the last user record
371+
which was processed.
372+
After the below call to restore_position() the cursor will point to
373+
the latter record, or, if it has been purged meanwhile, its closest
374+
non-purged predecessor (in extreme case this can be infimum of the first
375+
page in the tree!).
376+
By moving to the successor of the saved record we position the cursor
377+
either to supremum record (which means we restored original cursor
378+
position and can continue switch to the next page as usual) or to
379+
some user record which our scan have not processed yet (for example,
380+
this record might have been moved from the next page due to page
381+
merge or simply inserted to our page concurrently).
382+
The latter case is detected by caller by doing !is_after_last_on_page()
383+
check and instead of doing switch to the next page we continue processing
384+
from the restored user record. */
311385

312386
m_pcur->restore_position(BTR_SEARCH_LEAF, m_mtr, UT_LOCATION_HERE);
313387

314-
if (!m_pcur->is_after_last_on_page()) {
315-
/* Move to the successor of the saved record. */
316-
m_pcur->move_to_next_on_page();
317-
}
388+
ut_ad(!m_pcur->is_after_last_on_page());
389+
390+
if (m_supremum_saved) m_pcur->move_to_next_on_page();
318391
}
319392

320-
dberr_t PCursor::move_to_user_rec() noexcept {
393+
dberr_t PCursor::move_to_next_block_user_rec() noexcept {
321394
auto cur = m_pcur->get_page_cur();
322395
const auto next_page_no = btr_page_get_next(page_cur_get_page(cur), m_mtr);
323396

@@ -370,7 +443,28 @@ dberr_t PCursor::move_to_user_rec() noexcept {
370443

371444
dberr_t PCursor::restore_from_savepoint() noexcept {
372445
resume();
373-
return m_pcur->is_on_user_rec() ? DB_SUCCESS : move_to_user_rec();
446+
447+
/* We should not get cursor pointing to supremum when restoring cursor
448+
which has pointed to user record from the Parallel_reader:::Ctx::m_f
449+
callback. Otherwise the below call to move_to_next_block_user_rec()
450+
will position the cursor to the first user record on the next page
451+
and then calling code will do page_cur_move_to_next() right after
452+
that. Which means that one record will be incorrectly skipped by
453+
our scan. Luckily, this always holds with current code. */
454+
ut_ad(m_supremum_saved || !m_pcur->is_after_last_on_page());
455+
456+
/* Move to the next user record in the index tree if we are at supremum.
457+
Even though this looks awkward from the proper layering point of view,
458+
and ideally should have been done on the same code level as iteration
459+
over records, doing this here, actually, a) allows to avoid scenario in
460+
which savepoint/mini-transaction commit and latch release/restore happen
461+
for the same page twice - first time in page-end callback and then in
462+
move_to_next_block() call, b) handles a hypotetical situation when
463+
the index was reduced to the empty root page while latches were released
464+
by detecting it and bailing out early (it is important for us to bail out
465+
and not try to call savepoint() method in this case). */
466+
return (!m_pcur->is_after_last_on_page()) ? DB_SUCCESS
467+
: move_to_next_block_user_rec();
374468
}
375469

376470
dberr_t Parallel_reader::Thread_ctx::restore_from_savepoint() noexcept {
@@ -402,7 +496,7 @@ dberr_t PCursor::move_to_next_block(dict_index_t *index) {
402496

403497
err = restore_from_savepoint();
404498
} else {
405-
err = move_to_user_rec();
499+
err = move_to_next_block_user_rec();
406500
}
407501

408502
int n_retries [[maybe_unused]] = 0;

0 commit comments

Comments
 (0)