Commit f0b6652
committed
Bug#37163647 SR recovery issues
Backport to 7.6
Test
testSystemRestart is extended with a new test GCPSaveLagLcpSR
which exercises a multi-NG system with :
- All nodes but one suffering GCP_SAVE lag
- A subsequent LCP being triggered
- A subsequent System Restart
This exposes the problem mentioned in the bug
with CopyGCIReq copying state which leads to
the SR being unrecoverable.
The symptom is that the SR does not complete.
Problem
The DIH block in the Master/President role controls
the GCP_SAVE and COPY_GCI protocols.
The GCP_SAVE protocol results in updates to each
participating node's lastCommittedGCI metadata,
and the COPY_GCI protocol propagates this information
to all nodes.
System Restart code effectively assumes that :
- The newestRestorableGCI is the max of the stored
per-node lastCommittedGCIs
- The max of the lastCommittedGCIs is restorable
The recoverability-robustness of the system is improved
by moving the checks of these assumptions back from
the distributed System Restart phase to the CopyGCIREQ
propagation phase where a live President instructs each
node to write new state to disk.
Improvement 1
If there is some logic problem that could threaten
future recoverability of the cluster, it causes the
President to halt immediately.
While this can cause an immediate service outage, it
surfaces + avoids the risk of unrecoverability.
The situation where a non recoverable set of GCI info is
distributed should be rare. However if a running cluster is
upgraded to a version containing these checks then there is a
chance that a new version President will fail as a result of
inheriting inconsistent data from the previous old version
President.
This scenario is risky as it is in precisely this situation that
it is possible that the cluster is not SR recoverable, so we do
not want to risk needing an SR.
For this reason, as part of Master GCP takeover, all nodes will
align their inherited GCI info to ensure that it does not result
in an immediate failure.
Improvement 2
The President's logic in GCP_SAVEREQ is modified to avoid
directly updating the in-memory 'SYSFILE' as part of processing
GCP_SAVECONF signals from participating nodes.
The set of nodes which sent a CONF is instead stored in
a bitmap, leaving the lastCommittedGCI values intact.
When the GCP_SAVEREQ round is complete, the bitmap is used
to update the lastCommittedGci values atomically with the
newestRestorableGCI, so that any subsequent CopyGCIREQ
invocation will propagate them together.
This avoids an intermediate CopyGCIREQ (e.g. triggered
by the start of an LCP) attempting to propagate values
which would not be recoverable.
Change-Id: Ib2d5bf9dee5ae9c05670d02488adb39678ef3ac81 parent cbf32f0 commit f0b6652
File tree
5 files changed
+237
-3
lines changed- storage/ndb
- src/kernel/blocks/dbdih
- test
- ndbapi
- run-test
- src
5 files changed
+237
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1523 | 1523 | | |
1524 | 1524 | | |
1525 | 1525 | | |
| 1526 | + | |
| 1527 | + | |
1526 | 1528 | | |
1527 | 1529 | | |
1528 | 1530 | | |
| |||
2162 | 2164 | | |
2163 | 2165 | | |
2164 | 2166 | | |
| 2167 | + | |
2165 | 2168 | | |
2166 | 2169 | | |
2167 | 2170 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10523 | 10523 | | |
10524 | 10524 | | |
10525 | 10525 | | |
| 10526 | + | |
| 10527 | + | |
| 10528 | + | |
| 10529 | + | |
| 10530 | + | |
| 10531 | + | |
| 10532 | + | |
| 10533 | + | |
10526 | 10534 | | |
10527 | 10535 | | |
10528 | 10536 | | |
| |||
10920 | 10928 | | |
10921 | 10929 | | |
10922 | 10930 | | |
| 10931 | + | |
10923 | 10932 | | |
10924 | 10933 | | |
10925 | 10934 | | |
| |||
17100 | 17109 | | |
17101 | 17110 | | |
17102 | 17111 | | |
| 17112 | + | |
17103 | 17113 | | |
17104 | 17114 | | |
17105 | 17115 | | |
| |||
17223 | 17233 | | |
17224 | 17234 | | |
17225 | 17235 | | |
| 17236 | + | |
17226 | 17237 | | |
17227 | 17238 | | |
17228 | | - | |
| 17239 | + | |
| 17240 | + | |
| 17241 | + | |
| 17242 | + | |
| 17243 | + | |
17229 | 17244 | | |
17230 | 17245 | | |
17231 | 17246 | | |
| |||
17248 | 17263 | | |
17249 | 17264 | | |
17250 | 17265 | | |
| 17266 | + | |
17251 | 17267 | | |
17252 | 17268 | | |
17253 | 17269 | | |
| |||
17275 | 17291 | | |
17276 | 17292 | | |
17277 | 17293 | | |
| 17294 | + | |
| 17295 | + | |
| 17296 | + | |
| 17297 | + | |
| 17298 | + | |
| 17299 | + | |
| 17300 | + | |
| 17301 | + | |
| 17302 | + | |
| 17303 | + | |
| 17304 | + | |
| 17305 | + | |
| 17306 | + | |
| 17307 | + | |
| 17308 | + | |
| 17309 | + | |
17278 | 17310 | | |
17279 | 17311 | | |
17280 | 17312 | | |
| |||
17765 | 17797 | | |
17766 | 17798 | | |
17767 | 17799 | | |
17768 | | - | |
| 17800 | + | |
| 17801 | + | |
| 17802 | + | |
| 17803 | + | |
| 17804 | + | |
| 17805 | + | |
| 17806 | + | |
| 17807 | + | |
| 17808 | + | |
| 17809 | + | |
| 17810 | + | |
| 17811 | + | |
| 17812 | + | |
| 17813 | + | |
| 17814 | + | |
| 17815 | + | |
| 17816 | + | |
| 17817 | + | |
| 17818 | + | |
| 17819 | + | |
| 17820 | + | |
| 17821 | + | |
| 17822 | + | |
| 17823 | + | |
| 17824 | + | |
| 17825 | + | |
| 17826 | + | |
| 17827 | + | |
| 17828 | + | |
| 17829 | + | |
| 17830 | + | |
| 17831 | + | |
| 17832 | + | |
| 17833 | + | |
| 17834 | + | |
| 17835 | + | |
| 17836 | + | |
| 17837 | + | |
| 17838 | + | |
| 17839 | + | |
| 17840 | + | |
| 17841 | + | |
| 17842 | + | |
| 17843 | + | |
| 17844 | + | |
| 17845 | + | |
| 17846 | + | |
| 17847 | + | |
| 17848 | + | |
| 17849 | + | |
| 17850 | + | |
| 17851 | + | |
| 17852 | + | |
| 17853 | + | |
| 17854 | + | |
| 17855 | + | |
| 17856 | + | |
| 17857 | + | |
| 17858 | + | |
| 17859 | + | |
| 17860 | + | |
| 17861 | + | |
| 17862 | + | |
| 17863 | + | |
| 17864 | + | |
| 17865 | + | |
| 17866 | + | |
| 17867 | + | |
| 17868 | + | |
| 17869 | + | |
| 17870 | + | |
| 17871 | + | |
| 17872 | + | |
| 17873 | + | |
| 17874 | + | |
| 17875 | + | |
| 17876 | + | |
| 17877 | + | |
| 17878 | + | |
| 17879 | + | |
| 17880 | + | |
| 17881 | + | |
| 17882 | + | |
| 17883 | + | |
17769 | 17884 | | |
17770 | 17885 | | |
17771 | 17886 | | |
| |||
17836 | 17951 | | |
17837 | 17952 | | |
17838 | 17953 | | |
| 17954 | + | |
| 17955 | + | |
| 17956 | + | |
17839 | 17957 | | |
17840 | 17958 | | |
17841 | 17959 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4323 | 4323 | | |
4324 | 4324 | | |
4325 | 4325 | | |
| 4326 | + | |
| 4327 | + | |
| 4328 | + | |
| 4329 | + | |
| 4330 | + | |
| 4331 | + | |
| 4332 | + | |
| 4333 | + | |
| 4334 | + | |
| 4335 | + | |
| 4336 | + | |
| 4337 | + | |
| 4338 | + | |
| 4339 | + | |
| 4340 | + | |
| 4341 | + | |
| 4342 | + | |
| 4343 | + | |
| 4344 | + | |
| 4345 | + | |
| 4346 | + | |
| 4347 | + | |
| 4348 | + | |
| 4349 | + | |
| 4350 | + | |
| 4351 | + | |
| 4352 | + | |
| 4353 | + | |
| 4354 | + | |
| 4355 | + | |
| 4356 | + | |
| 4357 | + | |
| 4358 | + | |
| 4359 | + | |
| 4360 | + | |
| 4361 | + | |
| 4362 | + | |
| 4363 | + | |
| 4364 | + | |
| 4365 | + | |
| 4366 | + | |
| 4367 | + | |
| 4368 | + | |
| 4369 | + | |
| 4370 | + | |
| 4371 | + | |
| 4372 | + | |
| 4373 | + | |
| 4374 | + | |
| 4375 | + | |
| 4376 | + | |
| 4377 | + | |
| 4378 | + | |
| 4379 | + | |
| 4380 | + | |
| 4381 | + | |
| 4382 | + | |
| 4383 | + | |
| 4384 | + | |
| 4385 | + | |
| 4386 | + | |
| 4387 | + | |
| 4388 | + | |
| 4389 | + | |
| 4390 | + | |
| 4391 | + | |
| 4392 | + | |
| 4393 | + | |
| 4394 | + | |
| 4395 | + | |
| 4396 | + | |
| 4397 | + | |
| 4398 | + | |
| 4399 | + | |
| 4400 | + | |
| 4401 | + | |
| 4402 | + | |
| 4403 | + | |
| 4404 | + | |
| 4405 | + | |
| 4406 | + | |
| 4407 | + | |
| 4408 | + | |
| 4409 | + | |
| 4410 | + | |
| 4411 | + | |
| 4412 | + | |
| 4413 | + | |
| 4414 | + | |
| 4415 | + | |
| 4416 | + | |
| 4417 | + | |
| 4418 | + | |
| 4419 | + | |
| 4420 | + | |
| 4421 | + | |
| 4422 | + | |
| 4423 | + | |
| 4424 | + | |
| 4425 | + | |
4326 | 4426 | | |
4327 | 4427 | | |
4328 | 4428 | | |
| |||
4856 | 4956 | | |
4857 | 4957 | | |
4858 | 4958 | | |
| 4959 | + | |
| 4960 | + | |
| 4961 | + | |
| 4962 | + | |
| 4963 | + | |
| 4964 | + | |
| 4965 | + | |
4859 | 4966 | | |
4860 | 4967 | | |
4861 | 4968 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
308 | 308 | | |
309 | 309 | | |
310 | 310 | | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
219 | 219 | | |
220 | 220 | | |
221 | 221 | | |
| 222 | + | |
222 | 223 | | |
223 | 224 | | |
224 | 225 | | |
| |||
238 | 239 | | |
239 | 240 | | |
240 | 241 | | |
| 242 | + | |
241 | 243 | | |
242 | 244 | | |
243 | 245 | | |
| |||
256 | 258 | | |
257 | 259 | | |
258 | 260 | | |
259 | | - | |
| 261 | + | |
260 | 262 | | |
261 | 263 | | |
262 | 264 | | |
| |||
0 commit comments