Skip to content

Commit 1fad4e1

Browse files
committed
[ML] Speed up persistent task rechecks in ML failover tests (#43291)
The ML failover tests sometimes need to wait for jobs to be assigned to new nodes following a node failure. They wait 10 seconds for this to happen. However, if the node that failed was the master node and a new master was elected then this 10 seconds might not be long enough as a refresh of the memory stats will delay job assignment. Once the memory refresh completes the persistent task will be assigned when the next cluster state update occurs or after the periodic recheck interval, which defaults to 30 seconds. Rather than increase the length of the wait for assignment to 31 seconds, this change decreases the periodic recheck interval to 1 second. Fixes #43289
1 parent 5d3cae4 commit 1fad4e1

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/support/BaseMlIntegTestCase.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
import org.elasticsearch.index.reindex.ReindexPlugin;
2424
import org.elasticsearch.indices.recovery.RecoveryState;
2525
import org.elasticsearch.license.LicenseService;
26+
import org.elasticsearch.persistent.PersistentTasksClusterService;
2627
import org.elasticsearch.plugins.Plugin;
2728
import org.elasticsearch.test.ESIntegTestCase;
2829
import org.elasticsearch.test.discovery.TestZenDiscovery;
@@ -351,6 +352,17 @@ public static void deleteAllJobs(Logger logger, Client client) throws Exception
351352
}
352353

353354
protected String awaitJobOpenedAndAssigned(String jobId, String queryNode) throws Exception {
355+
356+
PersistentTasksClusterService persistentTasksClusterService =
357+
internalCluster().getInstance(PersistentTasksClusterService.class, internalCluster().getMasterName());
358+
// Speed up rechecks to a rate that is quicker than what settings would allow.
359+
// The check would work eventually without doing this, but the assertBusy() below
360+
// would need to wait 30 seconds, which would make the test run very slowly.
361+
// The 1 second refresh puts a greater burden on the master node to recheck
362+
// persistent tasks, but it will cope in these tests as it's not doing much
363+
// else.
364+
persistentTasksClusterService.setRecheckInterval(TimeValue.timeValueSeconds(1));
365+
354366
AtomicReference<String> jobNode = new AtomicReference<>();
355367
assertBusy(() -> {
356368
GetJobsStatsAction.Response statsResponse =

0 commit comments

Comments
 (0)