Use ExclusiveLock for accessing the table pg_resgroupcapability when CREATE/ALTER resouce group.

Wen Lin · web-flow · commit 3f58c328f94c · 2021-08-23T14:39:42.000+08:00
In some scenarios, the AccessExclusiveLock for table pg_resgroupcapability may cause database setup/recovery pending. Below is why we need change the AccessExclusiveLock to ExclusiveLock. This lock on table pg_resgroupcapability is used to concurrent update this table when run "Create/Alter resource group" statement. There is a CPU limit, after modify one resource group, it has to check if the whole CPU usage of all resource groups doesn't exceed 100%. Before this fix, AccessExclusiveLock is used. Suppose one user is running "Alter resource group" statement, QD will dispatch this statement to all QEs, so it is a two phase commit(2PC) transaction. When QD dispatched "Alter resource group" statement and QE acquire the AccessExclusiveLock for table pg_resgroupcapability. Until the 2PC distributed transaction committed, QE can release the AccessExclusiveLock for this table. In the second phase, QD will call function doNotifyingCommitPrepared to broadcast "commit prepared" command to all QEs, QE has already finish prepared, this transation is a prepared transaction. Suppose at this point, there is a primary segment down and a mirror will be promoted to primary. The mirror got the "promoted" message from coordinator, and will recover based on xlog from primary, in order to recover the prepared transaction, it will read the prepared transaction log entry and acquire AccessExclusiveLock for table pg_resgroupcapability. The callstack is: #0 lock_twophase_recover (xid=, info=, recdata=, len=) at lock.c:4697 #1 ProcessRecords (callbacks=, xid=2933, bufptr=0x1d575a8 "") at twophase.c:1757 #2 RecoverPreparedTransactions () at twophase.c:2214 #3 StartupXLOG () at xlog.c:8013 #4 StartupProcessMain () at startup.c:231 #5 AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff84b94a70) at bootstrap.c:459 #6 StartChildProcess (type=StartupProcess) at postmaster.c:5917 #7 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1d555b0) at postmaster.c:1581 #8 main (argc=7, argv=0x1d555b0) at main.c:240 After that, the database instance will start up, all related initialization functions will be called. However, there is a function named "InitResGroups", it will acquire AccessShareLock for table pg_resgroupcapability and do some initialization stuff. The callstack is: #6 WaitOnLock (locallock=locallock@entry=0x1c7f248, owner=owner@entry=0x1ca0a40) at lock.c:1999 #7 LockAcquireExtended (locktag=locktag@entry=0x7ffd15d18d90, lockmode=lockmode@entry=1, sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false, reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x7ffd15d18d88) at lock.c:1192 #8 LockRelationOid (relid=6439, lockmode=1) at lmgr.c:126 #9 relation_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at relation.c:56 #10 table_open (relationId=relationId@entry=6439, lockmode=lockmode@entry=1) at table.c:47 #11 InitResGroups () at resgroup.c:581 #12 InitResManager () at resource_manager.c:83 #13 initPostgres (in_dbname=, dboid=dboid@entry=0, username=username@entry=0x1c5b730 "linw", useroid=useroid@entry=0, out_dbname=out_dbname@entry=0x0, override_allow_connections=override_allow_connections@entry=false) at postinit.c:1284 #14 PostgresMain (argc=1, argv=argv@entry=0x1c8af78, dbname=0x1c89e70 "postgres", username=0x1c5b730 "linw") at postgres.c:4812 #15 BackendRun (port=, port=) at postmaster.c:4922 #16 BackendStartup (port=0x1c835d0) at postmaster.c:4607 #17 ServerLoop () at postmaster.c:1963 #18 PostmasterMain (argc=argc@entry=7, argv=argv@entry=0x1c595b0) at postmaster.c:1589 #19 in main (argc=7, argv=0x1c595b0) at main.c:240 The AccessExclusiveLock is not released, and it is not compatible with any other locks, so the startup process will be pending on this lock. So the mirror can't become primary successfully. Even users run "gprecoverseg" to recover the primary segment. the result is similar. The primary segment will recover from xlog, it will recover prepared transactions and acquire AccessExclusiveLock for table pg_resgroupcapability. Then the startup process is pending on this lock. Unless users change the resource type to "queue", the function InitResGroups will not be called, and won't be blocked, then the primary segment can startup normally. After this fix, ExclusiveLock is acquired when alter resource group. In above case, the startup process acquires AccessShareLock, ExclusiveLock and AccessShareLock are compatible. The startup process can run successfully. After startup, QE will get RECOVERY_COMMIT_PREPARED command from QD, it will finish the second phase of this distributed transaction and release ExclusiveLock on table pg_resgroupcapability. The callstack is: #0 lock_twophase_postcommit (xid=, info=, recdata=0x3303458, len=) at lock.c:4758 #1 ProcessRecords (callbacks=, xid=, bufptr=0x3303458 "") at twophase.c:1757 #2 FinishPreparedTransaction (gid=gid@entry=0x323caf5 "25", isCommit=isCommit@entry=true, raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at twophase.c:1704 #3 in performDtxProtocolCommitPrepared (gid=gid@entry=0x323caf5 "25", raiseErrorIfNotFound=raiseErrorIfNotFound@entry=false) at cdbtm.c:2107 #4 performDtxProtocolCommand (dtxProtocolCommand=dtxProtocolCommand@entry=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=gid@entry=0x323caf5 "25", contextInfo=contextInfo@entry=0x10e1820 ) at cdbtm.c:2279 #5 exec_mpp_dtx_protocol_command (contextInfo=0x10e1820 , gid=0x323caf5 "25", loggingStr=0x323cad8 "Recovery Commit Prepared", dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED) at postgres.c:1570 #6 PostgresMain (argc=, argv=argv@entry=0x3268f98, dbname=0x3267e90 "postgres", username=) at postgres.c:5482 The test case of this commit simulates a repro of this bug.
diff --git a/src/backend/commands/resgroupcmds.c b/src/backend/commands/resgroupcmds.c
@@ -136,10 +136,10 @@ CreateResourceGroup(CreateResourceGroupStmt *stmt)
 	/*
 	 * both CREATE and ALTER resource group need check the sum of cpu_rate_limit
 	 * and memory_limit and make sure the sum don't exceed 100. To make it simple,
-	 * acquire AccessExclusiveLock lock on pg_resgroupcapability at the beginning
+	 * acquire ExclusiveLock lock on pg_resgroupcapability at the beginning
 	 * of CREATE and ALTER
 	 */
-	pg_resgroupcapability_rel = table_open(ResGroupCapabilityRelationId, AccessExclusiveLock);
+	pg_resgroupcapability_rel = table_open(ResGroupCapabilityRelationId, ExclusiveLock);
 	pg_resgroup_rel = table_open(ResGroupRelationId, RowExclusiveLock);
 
 	/* Check if MaxResourceGroups limit is reached */
@@ -428,11 +428,18 @@ AlterResourceGroup(AlterResourceGroupStmt *stmt)
 	/*
 	 * In validateCapabilities() we scan all the resource groups
 	 * to check whether the total cpu_rate_limit exceed 100 or not.
-	 * We need to use AccessExclusiveLock here to prevent concurrent
-	 * increase on different resource group.
+	 * We use ExclusiveLock here to prevent concurrent
+	 * increase on different resource group. 
+	 * We can't use AccessExclusiveLock here, the reason is that, 
+	 * if there is a database recovery happened when run "alter resource group"
+	 * and acquire this kind of lock, the initialization of resource group 
+	 * in function InitResGroups will be pending during database startup, 
+	 * since this function will open this table with AccessShareLock, 
+	 * AccessExclusiveLock is not compatible with any other lock.
+	 * ExclusiveLock and AccessShareLock are compatible.
 	 */
 	pg_resgroupcapability_rel = heap_open(ResGroupCapabilityRelationId,
-										  AccessExclusiveLock);
+										  ExclusiveLock);
 
 	/* Load current resource group capabilities */
 	GetResGroupCapabilities(pg_resgroupcapability_rel, groupid, &oldCaps);
diff --git a/src/test/isolation2/expected/resgroup/resgroup_seg_down_2pc.out b/src/test/isolation2/expected/resgroup/resgroup_seg_down_2pc.out
@@ -0,0 +1,120 @@
+-- This test performs segment reconfiguration when "alter resource group" is executed in the two phase commit.
+-- The steps are, when run "alter resource group", before QD broadcasts commit prepared command to QEs(the
+-- second phase of 2PC), we trigger an error and cause one primary segment down.
+-- The expectation is "alter resource group" can run successfully since the mirror segment is UP.
+-- After recover the segment, there is no error or blocking.
+
+-- set these values purely to cut down test time, as default fts trigger is
+-- every min and 5 retries
+alter system set gp_fts_probe_interval to 10;
+ALTER
+alter system set gp_fts_probe_retries to 0;
+ALTER
+select pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t              
+(1 row)
+
+1:create resource group rgroup_seg_down with (CPU_RATE_LIMIT=35, MEMORY_LIMIT=35, CONCURRENCY=10);
+CREATE
+
+-- inject an error in function dtm_broadcast_commit_prepared, that is before QD broadcasts commit prepared command to QEs
+2:select gp_inject_fault_infinite('dtm_broadcast_commit_prepared', 'suspend', dbid) from gp_segment_configuration where role='p' and content=-1;
+ gp_inject_fault_infinite 
+--------------------------
+ Success:                 
+(1 row)
+-- this session will pend here since the above injected fault
+1&:alter resource group rgroup_seg_down set CONCURRENCY 20;  <waiting ...>
+-- this injected fault can make dispatcher think the primary is down
+2:select gp_inject_fault('fts_conn_startup_packet', 'error', dbid) from gp_segment_configuration where role='p' and content=0;
+ gp_inject_fault 
+-----------------
+ Success:        
+(1 row)
+2:select gp_request_fts_probe_scan();
+ gp_request_fts_probe_scan 
+---------------------------
+ t                         
+(1 row)
+-- make sure one primary segment is down.
+2:select status = 'd' from gp_segment_configuration where content = 0 and role = 'm';
+ ?column? 
+----------
+ t        
+(1 row)
+-- reset the injected fault on QD and the "alter resource group" in session1 can continue
+2:select gp_inject_fault('dtm_broadcast_commit_prepared', 'reset', dbid) from gp_segment_configuration where role='p' and content=-1;
+ gp_inject_fault 
+-----------------
+ Success:        
+(1 row)
+-- reset the injected fault on primary segment
+2:select gp_inject_fault('fts_conn_startup_packet', 'reset', dbid) from gp_segment_configuration where content=0;
+ gp_inject_fault 
+-----------------
+ Success:        
+ Success:        
+(2 rows)
+1<:  <... completed>
+ALTER
+-- make sure "alter resource group" has taken effect.
+1:select concurrency from gp_toolkit.gp_resgroup_config where groupname = 'rgroup_seg_down';
+ concurrency 
+-------------
+ 20          
+(1 row)
+2q: ... <quitting>
+
+!\retcode gprecoverseg -aF --no-progress;
+-- start_ignore
+-- end_ignore
+(exited with code 0)
+
+-- loop while segments come in sync
+1:select wait_until_all_segments_synchronized();
+ wait_until_all_segments_synchronized 
+--------------------------------------
+ OK                                   
+(1 row)
+
+!\retcode gprecoverseg -ar;
+-- start_ignore
+-- end_ignore
+(exited with code 0)
+
+-- loop while segments come in sync
+1:select wait_until_all_segments_synchronized();
+ wait_until_all_segments_synchronized 
+--------------------------------------
+ OK                                   
+(1 row)
+
+-- verify no segment is down after recovery
+1:select count(*) from gp_segment_configuration where status = 'd';
+ count 
+-------
+ 0     
+(1 row)
+
+-- verify resource group
+1:select concurrency from gp_toolkit.gp_resgroup_config where groupname = 'rgroup_seg_down';
+ concurrency 
+-------------
+ 20          
+(1 row)
+1:drop resource group rgroup_seg_down;
+DROP
+
+1:alter system reset gp_fts_probe_interval;
+ALTER
+1:alter system reset gp_fts_probe_retries;
+ALTER
+1:select pg_reload_conf();
+ pg_reload_conf 
+----------------
+ t              
+(1 row)
+1q: ... <quitting>
+
diff --git a/src/test/isolation2/isolation2_resgroup_schedule b/src/test/isolation2/isolation2_resgroup_schedule
@@ -10,6 +10,7 @@ test: resgroup/resgroup_name_convention
 # fault injection tests
 test: resgroup/resgroup_assign_slot_fail
 test: resgroup/resgroup_unassign_entrydb
+test: resgroup/resgroup_seg_down_2pc
 
 # functions
 test: resgroup/resgroup_concurrency
diff --git a/src/test/isolation2/sql/resgroup/resgroup_seg_down_2pc.sql b/src/test/isolation2/sql/resgroup/resgroup_seg_down_2pc.sql
@@ -0,0 +1,54 @@
+-- This test performs segment reconfiguration when "alter resource group" is executed in the two phase commit.
+-- The steps are, when run "alter resource group", before QD broadcasts commit prepared command to QEs(the 
+-- second phase of 2PC), we trigger an error and cause one primary segment down. 
+-- The expectation is "alter resource group" can run successfully since the mirror segment is UP. 
+-- After recover the segment, there is no error or blocking.
+
+-- set these values purely to cut down test time, as default fts trigger is
+-- every min and 5 retries
+alter system set gp_fts_probe_interval to 10;
+alter system set gp_fts_probe_retries to 0;
+select pg_reload_conf();
+
+1:create resource group rgroup_seg_down with (CPU_RATE_LIMIT=35, MEMORY_LIMIT=35, CONCURRENCY=10);
+
+-- inject an error in function dtm_broadcast_commit_prepared, that is before QD broadcasts commit prepared command to QEs
+2:select gp_inject_fault_infinite('dtm_broadcast_commit_prepared', 'suspend', dbid) from gp_segment_configuration where role='p' and content=-1;
+-- this session will pend here since the above injected fault
+1&:alter resource group rgroup_seg_down set CONCURRENCY 20;
+-- this injected fault can make dispatcher think the primary is down
+2:select gp_inject_fault('fts_conn_startup_packet', 'error', dbid) from gp_segment_configuration where role='p' and content=0;
+2:select gp_request_fts_probe_scan();
+-- make sure one primary segment is down.
+2:select status = 'd' from gp_segment_configuration where content = 0 and role = 'm';
+-- reset the injected fault on QD and the "alter resource group" in session1 can continue
+2:select gp_inject_fault('dtm_broadcast_commit_prepared', 'reset', dbid) from gp_segment_configuration where role='p' and content=-1;
+-- reset the injected fault on primary segment
+2:select gp_inject_fault('fts_conn_startup_packet', 'reset', dbid) from gp_segment_configuration where content=0;
+1<:
+-- make sure "alter resource group" has taken effect.
+1:select concurrency from gp_toolkit.gp_resgroup_config where groupname = 'rgroup_seg_down';
+2q:
+
+!\retcode gprecoverseg -aF --no-progress;
+
+-- loop while segments come in sync
+1:select wait_until_all_segments_synchronized();
+
+!\retcode gprecoverseg -ar;
+
+-- loop while segments come in sync
+1:select wait_until_all_segments_synchronized();
+
+-- verify no segment is down after recovery
+1:select count(*) from gp_segment_configuration where status = 'd';
+
+-- verify resource group 
+1:select concurrency from gp_toolkit.gp_resgroup_config where groupname = 'rgroup_seg_down';
+1:drop resource group rgroup_seg_down;
+
+1:alter system reset gp_fts_probe_interval;
+1:alter system reset gp_fts_probe_retries;
+1:select pg_reload_conf();
+1q:
+