DISTMYSQL-243 - Orchestrator uses inefficient subquery in REPLACE to update cluster aliases

satya-bodapati · satya-bodapati · commit 58b0bf9fbd9a · 2023-01-10T14:43:31.000+05:30
Problem:
--------
Orchestrator has backend table cluster_alias to store aliases.
At certain intervals, orchestrator will update the aliases or insert new host aliases.
To do this, it uses the below query in UpdateClusterAliases():
```
			replace into
					cluster_alias (alias, cluster_name, last_registered)
				select
				    suggested_cluster_alias,
						cluster_name,
						now()
					from
				    database_instance
				    left join database_instance_downtime using (hostname, port)
				  where
				    suggested_cluster_alias!=''
						/* exclude newly demoted, downtimed masters */
						and ifnull(
								database_instance_downtime.downtime_active = 1
								and database_instance_downtime.end_timestamp &gt; now()
								and database_instance_downtime.reason = ?
							, 0) = 0
					order by
						ifnull(last_checked &lt;= last_seen, 0) asc,
						read_only desc,
						num_slave_hosts asc

```

The problem with the select query is it will generated the same alias,cluster_name multiple times. REPLACE does this operation
by doing DELETE+INSERT. REPLACE repeatedly does the same work for all the duplicated records.

This creates un-necessary work for two sub systems in InnoDB
1. Purge
2. SELECTS (ReadView)

All those delete marked records create stress on Purge.
SELECTs, when they have to build a older version of record, they have to build a long chain of old version records (using undo log).
So the un-necessary REPLACE work will create a long chain of records to be built.

Fix:
----
Use GROUPBY to filter duplicate records in subquery in REPLACE.
diff --git a/go/inst/cluster_alias_dao.go b/go/inst/cluster_alias_dao.go
@@ -126,16 +126,15 @@ func UpdateClusterAliases() error {
 				    left join database_instance_downtime using (hostname, port)
 				  where
 				    suggested_cluster_alias!=''
+						and cluster_name != ''
 						/* exclude newly demoted, downtimed masters */
 						and ifnull(
 								database_instance_downtime.downtime_active = 1
 								and database_instance_downtime.end_timestamp > now()
 								and database_instance_downtime.reason = ?
 							, 0) = 0
-					order by
-						ifnull(last_checked <= last_seen, 0) asc,
-						read_only desc,
-						num_slave_hosts asc
+					group by
+					       suggested_cluster_alias, cluster_name
 			`, DowntimeLostInRecoveryMessage)
 		return log.Errore(err)
 	}