Commit 18e3a9d
committed
fix: fix CS kill/crash when writing data
Recent tests show chunks unavailable when performing the following
test:
- start writing small files in ec(6,2) in the background.
- kill two chunkservers.
- wait for the writes of the files to finish.
- bring the two chunkservers back.
- wait for the data to be replicated.
- stop some other two chunkservers.
- validate data is available.
In the last step, there are six chunkservers available and no chunk
parts missing so there should be chunks unavailable. The error
happening was CRC error in the kill and restarted chunkservers.
The issue found is the following:
- some chunk gets its data parts successfully written to the drive.
- the client gets to know this (chunk write finished OK) and sends
WRITE_END packet to the CSs.
- the CS gets killed after receiving the WRITE_END but before doing
the job_close (hddClose) that is the responsable function to sync the
metadata parts to the drive. Therefore, the data parts of those chunks
are fine, but the CRC of the blocks is incorrect.
- the client unlocks the chunk in the master side (WRITE_END packet)
without noticing any issue and without retrying the write (since it
finished everything it had to write).
- there is no version increase in the other chunk parts and after the
CS is restarted, its chunk parts are registered as good ones, despite
the previously mentioned CRC error (which no component knows about).
- after stopping other CSs and trying to write, the issue emerges.
The solution so far is to move the endChunkLock call to after the
job_close is processed and increase the priority of the close
operations. This way we make sure that master receives notice about the
write end after all that chunk part related operations are completed.
This solution does not solve the case when
USE_CHUNKSERVER_SIDE_CHUNK_LOCK option is disabled.
A test was added to check the previously mentioned scenario.
Signed-off-by: Dave <dave@leil.io>1 parent 238a9c8 commit 18e3a9d
File tree
5 files changed
+145
-9
lines changed- src/chunkserver
- tests/test_suites/LongSystemTests
5 files changed
+145
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
146 | 146 | | |
147 | 147 | | |
148 | 148 | | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
153 | 157 | | |
154 | 158 | | |
155 | 159 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
549 | 549 | | |
550 | 550 | | |
551 | 551 | | |
552 | | - | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
553 | 560 | | |
554 | | - | |
555 | | - | |
556 | | - | |
| 561 | + | |
557 | 562 | | |
558 | 563 | | |
| 564 | + | |
559 | 565 | | |
560 | 566 | | |
561 | 567 | | |
562 | 568 | | |
563 | 569 | | |
564 | 570 | | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
318 | 318 | | |
319 | 319 | | |
320 | 320 | | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
119 | | - | |
| 119 | + | |
| 120 | + | |
120 | 121 | | |
121 | 122 | | |
122 | 123 | | |
| |||
Lines changed: 100 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
0 commit comments