Skip to content

Commit 0d50abb

Browse files
committed
Bug#37023549 (1/5) Improve NdbApi SPJ scan protocol timeout handling
Backport to 7.6 Add error insert to SPJ to allow a 'hard' timeout + failure to close to be simulated. Change-Id: Ie2479be2496ff4eb8ebe32547d8a8cb513708d26 Bug#37023549 (2/5) Improve NdbApi SPJ scan protocol timeout handling Add SPJ scan testing SPJ scans are added to the test as they have a different implementation in the API and so need to be covered separately. Due to bugs in SPJ, this test causes data node failure without further fixes. Change-Id: I1b7dd00a3bbaaa88aaf689de2ffc617afd041f62 Bug#37023549 (3/5) Improve NdbApi SPJ scan protocol timeout handling Modify TC to disconnect bad APIs TC detects in many cases when an API sends a signal which is unexpected due the state discovered on an ApiConnectRecord. Some cases are understood and expected, and others are not, resulting in assertion failures in TC -> data node failure. These can be useful for finding logic errors but can have a large impact on users. This patch modifies TC to handle some of these situations differently : - Log information about the mismatch - Disconnect the API(s) involved (API sending the signal, API owning the ConnectRecord) This will result in : - The API(s) being disconnected by all data nodes - API failure handling on all data nodes performing cleanup of the API's transaction objects, continuing with commit or rollback + then releasing them - APIs having to reinitialise their connection state to the cluster. API disconnect + reconnect is generally less work, quicker and less risky for system availability than a data node restart. This change in functionality is tested to some extent by the SPJ timeout scenario in ndb_scan_protocol_timeout, but other situations leading to state mismatch may exist. Change-Id: Id88df14141983b4fef71e74ce275d6b4e26c6c58 Bug#37023549 (4/5) Improve NdbApi SPJ scan protocol timeout handling On timeout, make SPJ API mark scans as needing close SPJ API is aligned to the (fixed) behaviour of NdbApi on protocol timeout handling - marking scans as needing close at the kernel. This causes close() to attempt to close the scan, which will succeed in cases where the scan has timed out due to LOAD. Additionally, SPJ API is modified to log when these protocol timeout handling paths are used, and what the result is. The SPJ tests in ndb_scan_protocol_timeout for the 'LOAD' case now pass with no leaks, but the BUG case needs a further fix. Change-Id: I90e4421228f3db9c2de1de7ebb48ae81123abf68 Bug#37023549 (5/5) Improve NdbApi SPJ scan protocol timeout handling SPJ Timeout Release on Close timeout In the case where there is a timeout attempting to close an SPJ scan at the kernel, this patch sets the ReleaseOnClose variable so that the kernel side ApiConnectRecord will not be reused. This aligns with the normal scan behaviour in the close-failed-timeout case. The ndb_scan_protocol_timeout testcase result is updated with the expected results for SPJ scan timeouts due to load + bugs Change-Id: I04b199759415c3e0d8f6f6546551c0271fac768c
1 parent ec67f49 commit 0d50abb

File tree

7 files changed

+325
-52
lines changed

7 files changed

+325
-52
lines changed

mysql-test/suite/ndb/r/ndb_scan_protocol_timeout.result

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,70 @@ Leaks
304304
select count(1) as ops from ndbinfo.cluster_operations;
305305
ops
306306
0
307+
308+
-------------------------------
309+
SPJ request timeout due to load
310+
-------------------------------
311+
Standalone
312+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b where t1.a > 70;
313+
ERROR HY000: Got error 4008 'Receive from NDB failed' from NDBCLUSTER
314+
Check pk lookups
315+
Clear error condition
316+
Check transaction leaks
317+
Leaks
318+
0
319+
select count(1) as ops from ndbinfo.cluster_operations;
320+
ops
321+
0
322+
323+
-------------------------------
324+
SPJ request timeout due to load
325+
-------------------------------
326+
As part of a stateful transaction
327+
begin;
328+
insert into test.t1 values (54,54);
329+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b where t1.a > 70;
330+
ERROR HY000: Got error 4008 'Receive from NDB failed' from NDBCLUSTER
331+
Clear error condition
332+
Check pk lookups
333+
Check transaction leaks
334+
Leaks
335+
0
336+
select count(1) as ops from ndbinfo.cluster_operations;
337+
ops
338+
0
339+
340+
------------------------------
341+
SPJ request timeout due to bug
342+
------------------------------
343+
Standalone
344+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b;
345+
ERROR HY000: Got error 4008 'Receive from NDB failed' from NDBCLUSTER
346+
Check pk lookups
347+
Clear error condition
348+
Check transaction leaks
349+
Leaks
350+
1
351+
select count(1) as ops from ndbinfo.cluster_operations;
352+
ops
353+
0
354+
355+
-------------------------------
356+
SPJ request timeout due to bug
357+
------------------------------
358+
As part of a stateful transaction
359+
begin;
360+
insert into test.t1 values (54,54);
361+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b;
362+
ERROR HY000: Got error 4008 'Receive from NDB failed' from NDBCLUSTER
363+
Check pk lookups
364+
Clear error condition
365+
Check transaction leaks
366+
Leaks
367+
1
368+
select count(1) as ops from ndbinfo.cluster_operations;
369+
ops
370+
0
307371
SET SESSION debug=@save_debug;
308372
Normal requests
309373
---------------

mysql-test/suite/ndb/t/ndb_scan_protocol_timeout.test

Lines changed: 173 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
--echo # We are using some debug-only features in this test
33
-- source include/have_debug.inc
44
-- source have_ndb_error_insert.inc
5+
-- source include/have_query_cache_disabled.inc
56

67
#
78
# Test of scan protocol timeout behaviour
@@ -37,6 +38,12 @@
3738
# the ability of scan + transaction
3839
# close() to cleanup a scan
3940
#
41+
# 17123 (SPJ)
42+
# SPJ blocks SCAN_FRAGCONF processing
43+
# This stalls nextResult() and also
44+
# the ability of scan + transaction
45+
# close() to cleanup
46+
#
4047
# Timeout is expected to be mostly a result of
4148
# operations taking too long, and occasionally a
4249
# result of bugs
@@ -45,9 +52,9 @@
4552
# 'scan taking too long' and 8124 to represent
4653
# 'problem closing scan / bugs'.
4754
#
48-
# Ordered (e.g. by API) and unordered scans are taken
49-
# separately as they have separate implementations
50-
# inside NdbApi
55+
# Ordered (e.g. by API) and unordered scans and SPJ
56+
# scans are taken separately as they have separate
57+
# implementations inside NdbApi
5158
#
5259
# A .cnf file is used to avoid TDDT getting in the
5360
# way of API timeout testing
@@ -441,6 +448,169 @@ eval select $api_conn_count - $start_api_conn_count as Leaks;
441448
select count(1) as ops from ndbinfo.cluster_operations;
442449
#select * from ndbinfo.cluster_operations;
443450

451+
--echo
452+
--echo -------------------------------
453+
--echo SPJ request timeout due to load
454+
--echo -------------------------------
455+
--echo Standalone
456+
457+
# Error insert 5112 causes SPJ request timeout, but close is unaffected
458+
--exec $NDB_MGM -e "all error 5112" >> $NDB_TOOLS_OUTPUT
459+
460+
--sorted_result
461+
--error 1296
462+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b where t1.a > 70;
463+
464+
# Checking pk lookups checks that all usable ApiConnectRecords can
465+
# be used for a different transaction
466+
--echo Check pk lookups
467+
--disable_query_log
468+
--disable_result_log
469+
--let $i=0
470+
while ($i < $keycount)
471+
{
472+
--eval select * from test.t1 where a=$i;
473+
--inc $i
474+
}
475+
--enable_result_log
476+
--enable_query_log
477+
478+
--echo Clear error condition
479+
--exec $NDB_MGM -e "all error 0" >> $NDB_TOOLS_OUTPUT
480+
--sleep $STABILISATION_SECS
481+
482+
--echo Check transaction leaks
483+
--source ndb_get_api_connect_count.inc
484+
--disable_query_log
485+
eval select $api_conn_count - $start_api_conn_count as Leaks;
486+
--let $start_api_conn_count = $api_conn_count
487+
--enable_query_log
488+
select count(1) as ops from ndbinfo.cluster_operations;
489+
490+
--echo
491+
--echo -------------------------------
492+
--echo SPJ request timeout due to load
493+
--echo -------------------------------
494+
--echo As part of a stateful transaction
495+
496+
# Error insert 5112 causes SPJ request timeout, but close is unaffected
497+
--exec $NDB_MGM -e "all error 5112" >> $NDB_TOOLS_OUTPUT
498+
499+
begin;
500+
insert into test.t1 values (54,54);
501+
502+
--sorted_result
503+
--error 1296
504+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b where t1.a > 70;
505+
506+
--echo Clear error condition
507+
--exec $NDB_MGM -e "all error 0" >> $NDB_TOOLS_OUTPUT
508+
--sleep $STABILISATION_SECS
509+
510+
# Checking pk lookups checks that all usable ApiConnectRecords can
511+
# be used for a different transaction
512+
--echo Check pk lookups
513+
--disable_query_log
514+
--disable_result_log
515+
--let $i=0
516+
while ($i < $keycount)
517+
{
518+
--eval select * from test.t1 where a=$i;
519+
--inc $i
520+
}
521+
--enable_result_log
522+
--enable_query_log
523+
524+
--echo Check transaction leaks
525+
--source ndb_get_api_connect_count.inc
526+
--disable_query_log
527+
eval select $api_conn_count - $start_api_conn_count as Leaks;
528+
--let $start_api_conn_count = $api_conn_count
529+
--enable_query_log
530+
select count(1) as ops from ndbinfo.cluster_operations;
531+
532+
--echo
533+
--echo ------------------------------
534+
--echo SPJ request timeout due to bug
535+
--echo ------------------------------
536+
--echo Standalone
537+
538+
# Error insert 17123 causes SPJ request and close() to timeout
539+
--exec $NDB_MGM -e "all error 17123" >> $NDB_TOOLS_OUTPUT
540+
541+
--sorted_result
542+
--error 1296
543+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b;
544+
545+
# Checking pk lookups checks that all usable ApiConnectRecords can
546+
# be used for a different transaction
547+
--echo Check pk lookups
548+
--disable_query_log
549+
--disable_result_log
550+
--let $i=0
551+
while ($i < $keycount)
552+
{
553+
--eval select * from test.t1 where a=$i;
554+
--inc $i
555+
}
556+
--enable_result_log
557+
--enable_query_log
558+
559+
--echo Clear error condition
560+
--exec $NDB_MGM -e "all error 0" >> $NDB_TOOLS_OUTPUT
561+
--sleep $STABILISATION_SECS
562+
563+
--echo Check transaction leaks
564+
--source ndb_get_api_connect_count.inc
565+
--disable_query_log
566+
eval select $api_conn_count - $start_api_conn_count as Leaks;
567+
--let $start_api_conn_count = $api_conn_count
568+
--enable_query_log
569+
select count(1) as ops from ndbinfo.cluster_operations;
570+
571+
572+
--echo
573+
--echo -------------------------------
574+
--echo SPJ request timeout due to bug
575+
--echo ------------------------------
576+
--echo As part of a stateful transaction
577+
# Error insert 17123 causes SPJ request and close() to timeout
578+
--exec $NDB_MGM -e "all error 17123" >> $NDB_TOOLS_OUTPUT
579+
580+
begin;
581+
insert into test.t1 values (54,54);
582+
--sorted_result
583+
--error 1296
584+
select t1.b, t2.b from test.t1 join test.t2 on t2.a = t1.b;
585+
586+
# Checking pk lookups checks that all usable ApiConnectRecords can
587+
# be used for a different transaction
588+
--echo Check pk lookups
589+
--disable_query_log
590+
--disable_result_log
591+
--let $i=0
592+
while ($i < $keycount)
593+
{
594+
--eval select * from test.t1 where a=$i;
595+
--inc $i
596+
}
597+
--enable_result_log
598+
--enable_query_log
599+
600+
--echo Clear error condition
601+
--exec $NDB_MGM -e "all error 0" >> $NDB_TOOLS_OUTPUT
602+
--sleep $STABILISATION_SECS
603+
604+
--echo Check transaction leaks
605+
--source ndb_get_api_connect_count.inc
606+
--disable_query_log
607+
eval select $api_conn_count - $start_api_conn_count as Leaks;
608+
--let $start_api_conn_count = $api_conn_count
609+
--enable_query_log
610+
select count(1) as ops from ndbinfo.cluster_operations;
611+
612+
613+
444614
SET SESSION debug=@save_debug;
445615

446616
# Show that after all of the above, with error insertions disabled,
@@ -469,4 +639,3 @@ select count(1) as ops from ndbinfo.cluster_operations;
469639
drop table test.t1;
470640
drop table test.t2;
471641

472-
--remove_file $NDB_TOOLS_OUTPUT

storage/ndb/src/kernel/blocks/ERROR_codes.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Next DBTUX 12010
3838
Next SUMA 13060
3939
Next LGMAN 15002
4040
Next TSMAN 16002
41-
Next DBSPJ 17000
41+
Next DBSPJ 17124
4242
Next TRIX 18004
4343
Next DBUTIL 19002
4444

storage/ndb/src/kernel/blocks/dbspj/DbspjMain.cpp

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
Copyright (c) 2012, 2023, Oracle and/or its affiliates.
2+
Copyright (c) 2012, 2024, Oracle and/or its affiliates.
33
44
This program is free software; you can redistribute it and/or modify
55
it under the terms of the GNU General Public License, version 2.0,
@@ -3758,6 +3758,15 @@ Dbspj::execSCAN_FRAGCONF(Signal* signal)
37583758
DBLQH);
37593759
#endif
37603760

3761+
if (ERROR_INSERTED(17123)) {
3762+
jam();
3763+
g_eventLogger->info(
3764+
"Dbspj %u : Error insert stalling SCAN_FRAGCONF for 0.5s", instance());
3765+
sendSignalWithDelay(reference(), GSN_SCAN_FRAGCONF, signal, 500,
3766+
signal->getLength());
3767+
return;
3768+
}
3769+
37613770
Ptr<ScanFragHandle> scanFragHandlePtr;
37623771
ndbrequire(getGuardedPtr(scanFragHandlePtr, conf->senderData));
37633772
Ptr<TreeNode> treeNodePtr;

storage/ndb/src/kernel/blocks/dbtc/Dbtc.hpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2051,10 +2051,11 @@ class Dbtc
20512051
// Generated statement blocks
20522052
void warningHandlerLab(Signal* signal, int line);
20532053
void systemErrorLab(Signal* signal, int line);
2054-
void sendSignalErrorRefuseLab(Signal* signal);
2054+
void handleSignalStateProblem(Signal *signal,
2055+
ApiConnectRecordPtr apiConnectptr,
2056+
NodeId signalNodeId, Uint32 context);
20552057
void scanTabRefLab(Signal* signal, Uint32 errCode);
20562058
void diFcountReqLab(Signal* signal, ScanRecordPtr);
2057-
void signalErrorRefuseLab(Signal* signal);
20582059
void abort080Lab(Signal* signal);
20592060
void sendKeyInfoTrain(Signal* signal,
20602061
BlockReference TBRef,

0 commit comments

Comments
 (0)