Commit e331f9f
committed
PS-9214 : Alter table online results in "duplicate key" error on the primary key (only index).
https://perconadev.atlassian.net/browse/PS-9214
Problem:
--------
ALTER TABLE with rebuilds InnoDB table using INPLACE algorithm occasionally
might fail with unwarranted duplicate primary key error if there are
concurrent insertions into the table, even though these insertions do not
icause any PK conflict.
Analysis:
---------
New implementation of parallel ALTER TABLE INPLACE in InnoDB was introduced in
MySQL 8.0.27. Its code is used for online table rebuild even in a single-thread
case.
This implementation iterates over all the rows in the table, in general case,
handling different subtrees of a B-tree in different threads. This iteration
over table rows needs to be paused, from time to time, to commit InnoDB MTR/
release page latches it holds. This is necessary to give a way to concurrent
actions on the B-tree scanned or before flushing rows of new version of table
from in-memory buffer to the B-tree. In order to resume iteration after such
pause persistent cursor position saved before pause is restored.
The cause of the problem described above lies in how PCursor::savepoint()
and PCursor::resume() methods perform this saving and restore of cursor
position.
Instead of storing position of current record pointed by cursor savepoint()
stores the position of record that precedes it, and then resume() does
restore in two steps - 1) restores position of this preceding record
(or its closest precedessor if it was purged meanwhile) and then 2) moves
one step forward assuming that will get to the record at which cursor
pointed originally.
Such approach makes sense when we try to save/restore cursor pointing to
page's supremum record before switching to a new page, as it allows to
avoid problems with records being skipped when the next page is merged into
the current one while latches are released (so records which we have not
scanned yet are moved over supremum record, but not over the record which
originally preceded supremum).
***
Let us take look at an example. Let us assume that we have two pages
p1 : [inf, a, b, c, sup] and the next one p2 : [inf, d, e, f, sup].
Out thread which is one of the parallel ALTER TABLE worker threads
has completed scan of p1, so its cursor positioned on p1:'sup' record.
Now it needs to switch to page p2, but also give a way to threads
concurrently updating the table. So it needs to make cursor savepoint,
commit mini-transaction and release the latches.
If naive approach to making savepoint is used and we simply do
btr_pcur_t::store_position()/restore_position() with the cursor
positioned on p1 : 'sup' record, then the following might happen:
concurrent purge on page p2 might delete some record from it
(e.g. 'f') and decide to merge of this page into the page p1.
If this happens while latches are released this merge would go through
and and resulting in page p1 with the following contents
p1 : [inf, a, b, c, d, e, sup]. Savepoint for p1 : 'sup' won't
be invalidated (one can say that savepoints for sup and inf are not
safe against concurrent merges in this respect) and after restoration
of cursor the iteration will continue, on the next page, skipping
records 'd' and 'e'.
With non-naive approach implemented at the moment, PCursor::savepoint()
does a step back before calling btr_pcur_t::store_position(), so for
cursor positioned on p1 : 'sup' it is actually position corresponding
to p1 : 'c' what is saved. If the merge happens when latches are
released, we still get p1 : [inf, a, b, c, d, e, sup] and the savepoint
is not invalidated. PCursor::resume() calls btr_pcur_t::restore_position()
gets cursor pointing to p1 : 'c' as result, and then it tries to
compensate for step-back in PCursor::savepoint() and moves cursor
one step forward making it to point to p1 : 'd'. Code which does
scanning detects the situation that we savepoint()/resume() resulted
in jump from supremum record to user record and resume iteration
from p1 : 'd' without skipping any records.
***
However, it is not necessary and becomes problematic we try to save/restore
cursor pointing to user record from within record processing callback.
In this case record which position we are trying to save when savepoint()
method is called can be considered already processed as corresponding value
already will be inserted into output buffer soon after restore. When a
concurrent insert adds a new record between the record which position we
have inteded to save by calling savepoint() and its precedessor which
position this call stored internally, the later call to resume() will
position cursor at this newly inserted record. This will lead to the
resumed scan revisiting original record once again. As result the code
will attempt to add this original record into the output buffer one more
time and get duplicate key error.
***
Let us take a look at an example once again. Let us assume that
parallel ALTER TABLE thread is scanning page p1 with the following
contents - p1 : [inf, a, b, c, d, sup]. It has processed record 'c'
(so cursor points to p1: 'c') and decides that it needs take savepoint/
commit mini-transaction and release latches in order to flush in-memory
buffer with new versions of the records. PCursor::savepoint() does a
step back and calls btr_pcur_t::store_position() which saves position
corresponding to p1 : 'b'. After the latches are released concurrent
insert might happen into the page adding record 'b1', between records
'b' and 'c', resulting in page looking like p1: [inf, a, b, b1, c, d, sup].
After that PCursor::resume() will call restore_position() which will
restore cursor pointing to p1 : 'b' and then will try to compensate for
step-back in savepoint() by moving cursor one step forward to p1 : 'b1'.
The scanning code will continue its iteration using this cursor, by
moving to the next record - p1 :'c' and trying to process it once again,
resulting in duplicate key error.
***
Fix:
---
This patch solves the problem by adjusting PCursors::savepoint()/resume()
logic not to do this step back on save/step forward on restore if we are
trying to save/restore cursor pointing to a user record (in which case it
is not necessary). This process still used when we are trying to save/
restore cursor pointing to page infimum (where it is useful).
It also adds some comments explaining how this code works and a few
debug asserts enforcing its invariants.1 parent d9d50ad commit e331f9f
File tree
3 files changed
+166
-14
lines changed- mysql-test/suite/innodb
- r
- t
- storage/innobase/row
3 files changed
+166
-14
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
9 | | - | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
44 | 78 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
251 | 251 | | |
252 | 252 | | |
253 | 253 | | |
254 | | - | |
| 254 | + | |
| 255 | + | |
255 | 256 | | |
256 | 257 | | |
257 | 258 | | |
258 | | - | |
259 | | - | |
| 259 | + | |
| 260 | + | |
260 | 261 | | |
261 | 262 | | |
262 | 263 | | |
| |||
276 | 277 | | |
277 | 278 | | |
278 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
279 | 283 | | |
280 | 284 | | |
281 | 285 | | |
| |||
293 | 297 | | |
294 | 298 | | |
295 | 299 | | |
296 | | - | |
297 | | - | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
298 | 332 | | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
299 | 336 | | |
300 | 337 | | |
301 | 338 | | |
| |||
307 | 344 | | |
308 | 345 | | |
309 | 346 | | |
310 | | - | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
311 | 385 | | |
312 | 386 | | |
313 | 387 | | |
314 | | - | |
315 | | - | |
316 | | - | |
317 | | - | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
318 | 391 | | |
319 | 392 | | |
320 | | - | |
| 393 | + | |
321 | 394 | | |
322 | 395 | | |
323 | 396 | | |
| |||
370 | 443 | | |
371 | 444 | | |
372 | 445 | | |
373 | | - | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
374 | 468 | | |
375 | 469 | | |
376 | 470 | | |
| |||
402 | 496 | | |
403 | 497 | | |
404 | 498 | | |
405 | | - | |
| 499 | + | |
406 | 500 | | |
407 | 501 | | |
408 | 502 | | |
| |||
0 commit comments