Skip to content

Commit f0b6652

Browse files
committed
Bug#37163647 SR recovery issues
Backport to 7.6 Test testSystemRestart is extended with a new test GCPSaveLagLcpSR which exercises a multi-NG system with : - All nodes but one suffering GCP_SAVE lag - A subsequent LCP being triggered - A subsequent System Restart This exposes the problem mentioned in the bug with CopyGCIReq copying state which leads to the SR being unrecoverable. The symptom is that the SR does not complete. Problem The DIH block in the Master/President role controls the GCP_SAVE and COPY_GCI protocols. The GCP_SAVE protocol results in updates to each participating node's lastCommittedGCI metadata, and the COPY_GCI protocol propagates this information to all nodes. System Restart code effectively assumes that : - The newestRestorableGCI is the max of the stored per-node lastCommittedGCIs - The max of the lastCommittedGCIs is restorable The recoverability-robustness of the system is improved by moving the checks of these assumptions back from the distributed System Restart phase to the CopyGCIREQ propagation phase where a live President instructs each node to write new state to disk. Improvement 1 If there is some logic problem that could threaten future recoverability of the cluster, it causes the President to halt immediately. While this can cause an immediate service outage, it surfaces + avoids the risk of unrecoverability. The situation where a non recoverable set of GCI info is distributed should be rare. However if a running cluster is upgraded to a version containing these checks then there is a chance that a new version President will fail as a result of inheriting inconsistent data from the previous old version President. This scenario is risky as it is in precisely this situation that it is possible that the cluster is not SR recoverable, so we do not want to risk needing an SR. For this reason, as part of Master GCP takeover, all nodes will align their inherited GCI info to ensure that it does not result in an immediate failure. Improvement 2 The President's logic in GCP_SAVEREQ is modified to avoid directly updating the in-memory 'SYSFILE' as part of processing GCP_SAVECONF signals from participating nodes. The set of nodes which sent a CONF is instead stored in a bitmap, leaving the lastCommittedGCI values intact. When the GCP_SAVEREQ round is complete, the bitmap is used to update the lastCommittedGci values atomically with the newestRestorableGCI, so that any subsequent CopyGCIREQ invocation will propagate them together. This avoids an intermediate CopyGCIREQ (e.g. triggered by the start of an LCP) attempting to propagate values which would not be recoverable. Change-Id: Ib2d5bf9dee5ae9c05670d02488adb39678ef3ac8
1 parent cbf32f0 commit f0b6652

File tree

5 files changed

+237
-3
lines changed

5 files changed

+237
-3
lines changed

storage/ndb/src/kernel/blocks/dbdih/Dbdih.hpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1523,6 +1523,8 @@ class Dbdih: public SimulatedBlock {
15231523
void copyTabReq_complete(Signal* signal, TabRecordPtr tabPtr);
15241524

15251525
void gcpcommitreqLab(Signal *);
1526+
void validateCopyGci(Signal *);
1527+
void upgradeAlignCopyGci();
15261528
void copyGciLab(Signal *, CopyGCIReq::CopyReason reason);
15271529
void storeNewLcpIdLab(Signal *);
15281530
void startLcpRoundLoopLab(Signal *, Uint32 startTableId, Uint32 startFragId);
@@ -2162,6 +2164,7 @@ class Dbdih: public SimulatedBlock {
21622164
Uint32 m_new_gci;
21632165
Uint32 m_time_between_gcp; /* Delay between global checkpoints */
21642166
NDB_TICKS m_start_time;
2167+
NdbNodeBitmask m_saveConfNodes;
21652168
} m_master;
21662169
} m_gcp_save;
21672170

storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp

Lines changed: 120 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10523,6 +10523,14 @@ bool Dbdih::handle_master_take_over_copy_gci(Signal *signal, NodeId new_master_n
1052310523
signal, 10, MasterGCPReq::SignalLength);
1052410524
return true;
1052510525
}
10526+
10527+
/**
10528+
* We are more strict than older versions, align inherited
10529+
* info at Master takeover before the info has a chance
10530+
* to be checked if we become new Master
10531+
*/
10532+
upgradeAlignCopyGci();
10533+
1052610534
c_handled_master_take_over_copy_gci = new_master_node_id;
1052710535
return false;
1052810536
}
@@ -10920,6 +10928,7 @@ void Dbdih::MASTER_GCPhandling(Signal* signal, Uint32 failedNodeId)
1092010928
/**
1092110929
* Restart GCP_SAVE_REQ
1092210930
*/
10931+
m_gcp_save.m_master.m_saveConfNodes.clear();
1092310932
sendLoopMacro(GCP_SAVEREQ, sendGCP_SAVEREQ, RNIL);
1092410933
break;
1092510934
}
@@ -17100,6 +17109,7 @@ void Dbdih::execGCP_NODEFINISH(Signal* signal)
1710017109
Uint32 saveGCI = old_hi;
1710117110
m_gcp_save.m_master.m_state = GcpSave::GCP_SAVE_REQ;
1710217111
m_gcp_save.m_master.m_new_gci = saveGCI;
17112+
m_gcp_save.m_master.m_saveConfNodes.clear();
1710317113

1710417114
#ifdef ERROR_INSERT
1710517115
if (ERROR_INSERTED(7188))
@@ -17223,9 +17233,14 @@ void Dbdih::execGCP_SAVECONF(Signal* signal)
1722317233
return;
1722417234
}
1722517235

17236+
/* Master */
1722617237
ndbrequire(saveConf->gci == m_gcp_save.m_master.m_new_gci);
1722717238
ndbrequire(saveConf->nodeId == saveConf->dihPtr);
17228-
SYSFILE->lastCompletedGCI[saveConf->nodeId] = saveConf->gci;
17239+
17240+
/* Record CONF received in this round */
17241+
ndbrequire(!m_gcp_save.m_master.m_saveConfNodes.get(saveConf->nodeId));
17242+
m_gcp_save.m_master.m_saveConfNodes.set(saveConf->nodeId);
17243+
1722917244
GCP_SAVEhandling(signal, saveConf->nodeId);
1723017245
}//Dbdih::execGCP_SAVECONF()
1723117246

@@ -17248,6 +17263,7 @@ void Dbdih::execGCP_SAVEREF(Signal* signal)
1724817263

1724917264
ndbrequire(saveRef->gci == m_gcp_save.m_master.m_new_gci);
1725017265
ndbrequire(saveRef->nodeId == saveRef->dihPtr);
17266+
ndbrequire(!m_gcp_save.m_master.m_saveConfNodes.get(saveRef->nodeId));
1725117267

1725217268
/**
1725317269
* Only allow reason not to save
@@ -17275,6 +17291,22 @@ void Dbdih::GCP_SAVEhandling(Signal* signal, Uint32 nodeId)
1727517291
* RESTART.
1727617292
*------------------------------------------------------------------------*/
1727717293
SYSFILE->newestRestorableGCI = m_gcp_save.m_gci;
17294+
17295+
/**
17296+
* Set lastCompletedGci values for all CONFed participants in
17297+
* the GCP Save round atomically with the newestRestorableGci
17298+
* now.
17299+
* This avoids any intermediate CopyGCIReq rounds propagating
17300+
* transition states prior to a GCI being fully restorable.
17301+
*/
17302+
{
17303+
Uint32 participant = 0;
17304+
while ((participant = m_gcp_save.m_master.m_saveConfNodes.find_next(
17305+
participant + 1)) != BitmaskImpl::NotFound) {
17306+
SYSFILE->lastCompletedGCI[participant] = m_gcp_save.m_gci;
17307+
}
17308+
m_gcp_save.m_master.m_saveConfNodes.clear();
17309+
}
1727817310
if(Sysfile::getInitialStartOngoing(SYSFILE->systemRestartBits) &&
1727917311
getNodeState().startLevel == NodeState::SL_STARTED){
1728017312
jam();
@@ -17765,7 +17797,90 @@ void Dbdih::execDIHNDBTAMPER(Signal* signal)
1776517797
/*****************************************************************************/
1776617798
/* ********** FILE HANDLING MODULE *************/
1776717799
/*****************************************************************************/
17768-
void Dbdih::copyGciLab(Signal* signal, CopyGCIReq::CopyReason reason)
17800+
void Dbdih::validateCopyGci(Signal *signal) {
17801+
jam();
17802+
/**
17803+
* Before we (Master) copy our GCI info to all other
17804+
* nodes, let's check it for sanity
17805+
*/
17806+
bool newestRestorableGCIIsMax = true;
17807+
const Uint32 newestRestorableGCI = SYSFILE->newestRestorableGCI;
17808+
17809+
for (Uint32 i = 0; i < MAX_NDB_NODES; i++) {
17810+
const Uint32 nodeLastCompletedGci = SYSFILE->lastCompletedGCI[i];
17811+
17812+
if (nodeLastCompletedGci > newestRestorableGCI) {
17813+
jam();
17814+
newestRestorableGCIIsMax = false;
17815+
}
17816+
}
17817+
17818+
if (unlikely(!newestRestorableGCIIsMax)) {
17819+
jam();
17820+
g_eventLogger->error("DIH : newestRestorableGCI %u", newestRestorableGCI);
17821+
for (Uint32 i = 0; i < MAX_NDB_NODES; i++) {
17822+
if (Sysfile::getNodeStatus(i, SYSFILE->nodeStatus) != Sysfile::NS_NotDefined)
17823+
g_eventLogger->error("DIH : Node %u lastCompletedGCI %u", i,
17824+
SYSFILE->lastCompletedGCI[i]);
17825+
}
17826+
17827+
if (!newestRestorableGCIIsMax) {
17828+
jam();
17829+
/**
17830+
* Require that the newestRestorableGci number is the max of
17831+
* the per-node lastCommittedGci numbers
17832+
* Otherwise one of those can be chosen as President in a subsequent
17833+
* System Restart making it unrecoverable.
17834+
*/
17835+
g_eventLogger->error(
17836+
"DIH : Invalid CopyGCIREQ attempted, newestRestorableGCI is not max");
17837+
}
17838+
17839+
ndbrequire(newestRestorableGCIIsMax);
17840+
}
17841+
}
17842+
17843+
void Dbdih::upgradeAlignCopyGci() {
17844+
jam();
17845+
17846+
/**
17847+
* We now require that the newestRestorableGci is the
17848+
* highest Gci recorded in the per-node lastCompletedGci
17849+
* array.
17850+
*
17851+
* This is enforced in new code, but may not have been
17852+
* the case previously.
17853+
*
17854+
* In case we inherit invalid GCI info from the previous
17855+
* Master, lets filter it here to avoid e.g. a cascading
17856+
* failure as part of RR, which would in fact require
17857+
* a subsequent SR exactly when the SR is problematic.
17858+
*
17859+
* The assumption here is that any case where the
17860+
* lastCompletedGci is > newestRestorableGci is due to
17861+
* exposure of a transient state prior to GCP_SAVE round
17862+
* completion, which can be safely treated as though the
17863+
* round were not completed.
17864+
*
17865+
* When we are no longer likely to upgrade from a system
17866+
* which may send invalid CopyGci data, this realignment
17867+
* logic can be removed.
17868+
*/
17869+
for (Uint32 i = 0; i < MAX_NDB_NODES; i++) {
17870+
if (SYSFILE->lastCompletedGCI[i] > SYSFILE->newestRestorableGCI) {
17871+
jam();
17872+
g_eventLogger->warning(
17873+
"DIH : Aligning lastCompletedGCI of node %u from %u to %u", i,
17874+
SYSFILE->lastCompletedGCI[i], SYSFILE->newestRestorableGCI);
17875+
/* This is only intended for one specific upgrade scenario */
17876+
ndbrequire(SYSFILE->lastCompletedGCI[i] ==
17877+
SYSFILE->newestRestorableGCI + 1);
17878+
SYSFILE->lastCompletedGCI[i] = SYSFILE->newestRestorableGCI;
17879+
}
17880+
}
17881+
}
17882+
17883+
void Dbdih::copyGciLab(Signal* signal, CopyGCIReq::CopyReason reason)
1776917884
{
1777017885
if(c_copyGCIMaster.m_copyReason != CopyGCIReq::IDLE)
1777117886
{
@@ -17836,6 +17951,9 @@ void Dbdih::copyGciLab(Signal* signal, CopyGCIReq::CopyReason reason)
1783617951
}
1783717952
}
1783817953

17954+
/* Check integrity of GCI info before propagating */
17955+
validateCopyGci(signal);
17956+
1783917957
sendLoopMacro(COPY_GCIREQ, sendCOPY_GCIREQ, RNIL);
1784017958

1784117959
}//Dbdih::copyGciLab()

storage/ndb/test/ndbapi/testSystemRestart.cpp

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4323,6 +4323,106 @@ int runCrashNodeDuringSR(NDBT_Context *ctx, NDBT_Step *step) {
43234323
return result;
43244324
}
43254325

4326+
int checkMultipleNGs(NDBT_Context *ctx, NDBT_Step *step) {
4327+
NdbRestarter restarter;
4328+
4329+
if (restarter.getNumNodeGroups() < 2) {
4330+
g_err << "Need multiple nodegroups for test" << endl;
4331+
return NDBT_SKIPPED;
4332+
}
4333+
4334+
return NDBT_OK;
4335+
}
4336+
4337+
int runGCPSaveLagLcpSR(NDBT_Context *ctx, NDBT_Step *step) {
4338+
/**
4339+
* Test behaviour with sequence :
4340+
* Multiple nodegroups
4341+
* GCP Save stalled on every node except one
4342+
* ...
4343+
* LCP triggered
4344+
* ...
4345+
* System failure
4346+
* System restart
4347+
*
4348+
* This exercises code which copies the DIH 'recoverable GCI'
4349+
* metadata around the system.
4350+
* A partially complete GCP is not yet restorable - the execution
4351+
* of an LCP should not cause a subsequent SR to attempt to
4352+
* restore it.
4353+
*/
4354+
NdbRestarter restarter;
4355+
4356+
const Uint32 startLcpDelaySecs = 10;
4357+
const Uint32 postLcpDelaySecs = 5;
4358+
const Uint32 stallGCPSaveEICode = 7237;
4359+
4360+
const int node = restarter.getNode(NdbRestarter::NS_RANDOM);
4361+
4362+
/* Inject error on all nodes other than the node */
4363+
g_err << "Injecting error " << stallGCPSaveEICode
4364+
<< " to stall GCP_SAVE on all nodes "
4365+
<< " other than " << node << endl;
4366+
if (restarter.insertErrorInAllNodes(stallGCPSaveEICode) != 0) {
4367+
g_err << "Error inserting error in all" << endl;
4368+
restarter.insertErrorInAllNodes(0);
4369+
return NDBT_FAILED;
4370+
}
4371+
if (restarter.insertErrorInNode(node, 0) != 0) {
4372+
g_err << "Error clearing error in node " << node << endl;
4373+
restarter.insertErrorInAllNodes(0);
4374+
return NDBT_FAILED;
4375+
}
4376+
4377+
g_err << "Delay " << startLcpDelaySecs << " seconds before triggering LCP"
4378+
<< endl;
4379+
NdbSleep_SecSleep(startLcpDelaySecs);
4380+
4381+
g_err << "Triggering LCP" << endl;
4382+
const int startLcpDumpCode = 7099;
4383+
if (restarter.dumpStateAllNodes(&startLcpDumpCode, 1) != 0) {
4384+
restarter.insertErrorInAllNodes(0);
4385+
return NDBT_FAILED;
4386+
}
4387+
4388+
g_err << "Waiting for " << postLcpDelaySecs
4389+
<< " seconds after triggering Lcp." << endl;
4390+
NdbSleep_SecSleep(postLcpDelaySecs);
4391+
4392+
g_err << "Triggering immediate System Restart" << endl;
4393+
if (restarter.restartAll(false, // initial
4394+
true, // nostart
4395+
true, // abort
4396+
false) // force
4397+
!= 0) {
4398+
restarter.insertErrorInAllNodes(0);
4399+
g_err << "Triggering SR failed" << endl;
4400+
return NDBT_FAILED;
4401+
}
4402+
4403+
g_err << "Waiting for NoStart state" << endl;
4404+
if (restarter.waitClusterNoStart() != 0) {
4405+
g_err << "Failed waiting for nodes to enter NoStart state" << endl;
4406+
return NDBT_FAILED;
4407+
}
4408+
4409+
g_err << "Starting cluster" << endl;
4410+
if (restarter.startAll() != 0) {
4411+
g_err << "Failed to request start of all nodes" << endl;
4412+
return NDBT_FAILED;
4413+
}
4414+
4415+
g_err << "Waiting for cluster to recover" << endl;
4416+
if (restarter.waitClusterStarted() != 0) {
4417+
g_err << "Cluster failed to start" << endl;
4418+
return NDBT_FAILED;
4419+
}
4420+
4421+
g_err << "Cluster recovered successfully" << endl;
4422+
4423+
return NDBT_OK;
4424+
}
4425+
43264426
/**************************************************************************/
43274427

43284428
NDBT_TESTSUITE(testSystemRestart);
@@ -4856,6 +4956,13 @@ TESTCASE("SystemDownDuringSR", "Check recoverability when system goes down "
48564956
"during system restart") {
48574957
STEP(runCrashNodeDuringSR);
48584958
}
4959+
TESTCASE("GCPSaveLagLcpSR",
4960+
"Check recoverability when system fails during "
4961+
"GCP Save lag with an intermediate LCP start") {
4962+
INITIALIZER(checkMultipleNGs);
4963+
INITIALIZER(runWaitStarted);
4964+
STEP(runGCPSaveLagLcpSR);
4965+
}
48594966

48604967
NDBT_TESTSUITE_END(testSystemRestart);
48614968

storage/ndb/test/run-test/daily-devel--07-tests.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -308,3 +308,7 @@ max-time: 180
308308
cmd: testScan
309309
args: -n ScanErrorHandling T1
310310
max-time: 480
311+
312+
cmd: testSystemRestart
313+
args: -n GCPSaveLagLcpSR T1
314+
max-time: 240

storage/ndb/test/src/NdbRestarter.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,7 @@ NdbRestarter::getNodeGroups(Vector<int>& node_groups, int * max_alive_replicas_p
219219
return -1;
220220
}
221221

222+
int n_groups = 0;
222223
Vector<int> node_group_replicas;
223224
for (unsigned i = 0; i < ndbNodes.size(); i++)
224225
{
@@ -238,6 +239,7 @@ NdbRestarter::getNodeGroups(Vector<int>& node_groups, int * max_alive_replicas_p
238239
if (node_group_replicas[node_group] == 0)
239240
{
240241
node_groups.push_back(node_group);
242+
n_groups++;
241243
}
242244

243245
node_group_replicas[node_group]++;
@@ -256,7 +258,7 @@ NdbRestarter::getNodeGroups(Vector<int>& node_groups, int * max_alive_replicas_p
256258
}
257259
*max_alive_replicas_ptr = max_alive_replicas;
258260
}
259-
return 0;
261+
return n_groups;
260262
}
261263

262264
int NdbRestarter::getNumNodeGroups() {

0 commit comments

Comments
 (0)