fix(scheduler): Coordinator Task Throttling Bug (#27146)

shelton408 · web-flow · commit 88b63434a29d · 2026-02-17T15:19:56.000-08:00
## Description With coordinator task based throttling (queueing) enabled, we run into an issue where certain resource groups are never updated to be eligible to run. This occurs when the resource group is created during a task throttling period and canRun returns false, resulting in the resource group never being added as an eligible subgroup on creation. When we exit task throttling, an eligibility update is never triggered. if this group doesnt have a new query added after we exit task throttling, its status is never updated. Changes: 1. move the isTaskLimitExceeded check from canRunMore to internalStartNext, canRunMore will return true allowing the group to be marked as eligible, but internalStartNext will prevent the group from running more queries. 2. add check to enqueue immediate execution candidates if task throttling 3. remove experimental from session property 4. add tests to ensure resource groups properly queue/run queries with task limits (should this be in resourceGroups or testQueryTaskLimit?) Meta Internal review by: spershin Meta Internal Differential Revision: D92632990 ## Motivation and Context Coordinator memory is being overloaded by queries with large task counts. There needs to be safeguards on this outside of just RG's. The existing coordinator task throttling property has some issues which are fixed by this PR. ## Impact Coordinator task throttling no longer causes stuck resource groups. Config renamed from experimental.max-total-running-task-count-to-not-execute-new-query -> max-total-running-task-count-to-not-execute-new-query, however the old config will be kept as a legacy config for backwards compatibility Coordinator task throttling, when used in conjunction with query-pacing, should limit the number of tasks on the cluster close to the limit. ## Test Plan Bug Reproduction Set task limit to 1 on a test cluster. Trigger multiple queries that peak at 10-30 tasks and have execution time from 10-30 secs <img width="2732" height="1482" alt="image" src="https://github.com/user-attachments/assets/3bef60b3-ee39-4190-8e0f-b972736876af" /> repro with larger query suite <img width="2594" height="1254" alt="image" src="https://github.com/user-attachments/assets/dd26ed05-4c33-4c46-ae70-fed41098c389" /> Test: build and push again to test cluster, test previous repro Seems to kick in as expected, cluster submits a lot of queries as running before TaskLimitExceeded fires, after which it seems to run 1-3 queries at a time for the remainder of the queue. However it seemed like the cluster was still admitting queries slowly even in a task throttling state <img width="870" height="686" alt="image" src="https://github.com/user-attachments/assets/04c14abd-8c74-4c90-9625-2b2119ee2fc6" /> Following the previous fix, it was noticed that internalStartNext would not prevent immediate executions, only queued queries. This was then patched to block immediate executions during task throttling periods to prevent queries from running while in a task throttling state. Test with second fix <img width="2438" height="1150" alt="image" src="https://github.com/user-attachments/assets/3ed7d10f-2d5f-47f3-9969-e343b467ef5f" /> The spikes in this fix are because multiple queries can be admitted with no pacing, before re-entering task throttling state. With query pacing, this aspect should be mitigated. ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. ## Summary by Sourcery Fix coordinator task-based throttling so resource groups correctly queue and start queries when task limits are exceeded and later cleared. Bug Fixes: - Ensure resource groups remain eligible and properly queue queries instead of silently starving when the coordinator task limit is exceeded. - Prevent new queries from starting immediately when the coordinator is overloaded while still allowing existing running queries to continue. Enhancements: - Refine admission control in resource groups to consider coordinator overload separately from eligibility and concurrency checks. - Promote the task-limit-based throttling session property from experimental by renaming its configuration key. Tests: - Add unit tests covering query queuing and execution across task-limit transitions, including subgroup hierarchies and multiple throttle cycles. - Update configuration and task-limit integration tests to use the non-experimental task throttling property. ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == General Changes * Fix a bug where queries could get permanently stuck in resource groups when coordinator task-based throttling (``experimental.max-total-running-task-count-to-not-execute-new-query``) is enabled. * Replace experimental.max-total-running-task-count-to-not-execute-new-query with max-total-running-task-count-to-not-execute-new-query, this is backwards compatible
diff --git a/presto-docs/src/main/sphinx/admin/properties.rst b/presto-docs/src/main/sphinx/admin/properties.rst
@@ -1368,6 +1368,40 @@ This allows the cluster to quickly ramp up when idle while still providing
 protection against overload when the cluster is busy. Set to ``0`` to always
 apply pacing when ``max-queries-per-second`` is configured.
 
+``max-total-running-task-count-to-not-execute-new-query``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* **Type:** ``integer``
+* **Minimum value:** ``1``
+* **Default value:** ``2147483647`` (unlimited)
+
+Maximum total running task count across all queries on the coordinator. When
+this threshold is exceeded, new queries are held in the queue rather than
+being scheduled for execution. This helps prevent coordinator overload by
+limiting the number of concurrent tasks being managed.
+
+Unlike ``max-total-running-task-count-to-kill-query`` which kills queries when
+the limit is exceeded, this property proactively prevents new queries from
+starting while allowing existing queries to complete normally.
+
+This property works in conjunction with query admission pacing
+(``query-manager.query-pacing.max-queries-per-second``) to provide
+comprehensive coordinator load management. When both are configured:
+
+1. Pacing controls the rate at which queries are admitted
+2. This property provides a hard cap on total concurrent tasks
+
+Without query-pacing, the cluster can admit multiple queries at once, which
+can lead to significantly more concurrent tasks than expected over this limit.
+
+Set to a lower value (e.g., ``50000``) to limit coordinator task management
+overhead. The default value effectively disables this feature.
+
+.. note::
+
+    For backwards compatibility, this property can also be configured using the
+    legacy name ``experimental.max-total-running-task-count-to-not-execute-new-query``.
+
 Query Retry Properties
 ----------------------
 
diff --git a/presto-main-base/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java b/presto-main-base/src/main/java/com/facebook/presto/execution/QueryManagerConfig.java
@@ -321,7 +321,8 @@ public int getMaxQueryRunningTaskCount()
         return maxQueryRunningTaskCount;
     }
 
-    @Config("experimental.max-total-running-task-count-to-not-execute-new-query")
+    @LegacyConfig("experimental.max-total-running-task-count-to-not-execute-new-query")
+    @Config("max-total-running-task-count-to-not-execute-new-query")
     @ConfigDescription("Keep new queries in the queue if total task count exceeds this threshold")
     public QueryManagerConfig setMaxTotalRunningTaskCountToNotExecuteNewQuery(int maxTotalRunningTaskCountToNotExecuteNewQuery)
     {
diff --git a/presto-main-base/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java b/presto-main-base/src/main/java/com/facebook/presto/execution/resourceGroups/InternalResourceGroup.java
@@ -740,7 +740,18 @@ public void run(ManagedQueryExecution query)
             else {
                 query.setResourceGroupQueryLimits(perQueryLimits);
                 boolean immediateStartCandidate = canRun && queuedQueries.isEmpty();
-                if (immediateStartCandidate && queryPacingContext.tryAcquireAdmissionSlot()) {
+                boolean startQuery = immediateStartCandidate;
+                if (immediateStartCandidate) {
+                    // Check for coordinator overload (task limit exceeded or denied admission)
+                    //isTaskLimitExceeded MUST be checked before tryAcquireAdmissionSlot, or else admission slots will be acquired but not started
+                    boolean coordOverloaded = ((RootInternalResourceGroup) root).isTaskLimitExceeded()
+                            || !queryPacingContext.tryAcquireAdmissionSlot();
+                    if (coordOverloaded) {
+                        startQuery = false;
+                    }
+                }
+
+                if (startQuery) {
                     startInBackground(query);
                 }
                 else {
@@ -914,6 +925,10 @@ protected boolean internalStartNext()
     {
         checkState(Thread.holdsLock(root), "Must hold lock to find next query");
         synchronized (root) {
+            if (((RootInternalResourceGroup) root).isTaskLimitExceeded()) {
+                return false;
+            }
+
             if (!canRunMore()) {
                 return false;
             }
@@ -1052,10 +1067,6 @@ private boolean canRunMore()
                 return false;
             }
 
-            if (((RootInternalResourceGroup) root).isTaskLimitExceeded()) {
-                return false;
-            }
-
             int hardConcurrencyLimit = getHardConcurrencyLimitBasedOnCpuUsage();
 
             int totalRunningQueries = runningQueries.size() + descendantRunningQueries;
diff --git a/presto-main-base/src/test/java/com/facebook/presto/execution/TestQueryManagerConfig.java b/presto-main-base/src/test/java/com/facebook/presto/execution/TestQueryManagerConfig.java
@@ -102,7 +102,7 @@ public void testExplicitPropertyMappings()
                 .put("query.stage-count-warning-threshold", "12300")
                 .put("max-total-running-task-count-to-kill-query", "60000")
                 .put("max-query-running-task-count", "10000")
-                .put("experimental.max-total-running-task-count-to-not-execute-new-query", "50000")
+                .put("max-total-running-task-count-to-not-execute-new-query", "50000")
                 .put("concurrency-threshold-to-enable-resource-group-refresh", "2")
                 .put("resource-group-runtimeinfo-refresh-interval", "10ms")
                 .put("query.schedule-split-batch-size", "99")
diff --git a/presto-main-base/src/test/java/com/facebook/presto/execution/resourceGroups/TestResourceGroups.java b/presto-main-base/src/test/java/com/facebook/presto/execution/resourceGroups/TestResourceGroups.java
@@ -1098,4 +1098,215 @@ public String getName()
 
         return new ClusterResourceChecker(mockPolicy, config, createNodeManager());
     }
+
+    // Tests that when task limit is exceeded, new queries are queued instead of starting immediately
+    @Test(timeOut = 10_000)
+    public void testTaskLimitExceededQueuesQuery()
+    {
+        RootInternalResourceGroup root = new RootInternalResourceGroup(
+                "root",
+                (group, export) -> {},
+                directExecutor(),
+                ignored -> Optional.empty(),
+                rg -> false,
+                createNodeManager(),
+                createClusterResourceChecker(),
+                QueryPacingContext.NOOP);
+        root.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        root.setMaxQueuedQueries(10);
+        root.setHardConcurrencyLimit(10);
+
+        // Set task limit exceeded
+        root.setTaskLimitExceeded(true);
+
+        // Submit a query - it should be queued because task limit is exceeded
+        MockManagedQueryExecution query1 = new MockManagedQueryExecution(0);
+        query1.startWaitingForPrerequisites();
+        root.run(query1);
+
+        // Query should be queued, not running
+        assertEquals(query1.getState(), QUEUED);
+        assertEquals(root.getQueuedQueries(), 1);
+        assertEquals(root.getRunningQueries(), 0);
+    }
+
+    // Tests that queued queries start when task limit is no longer exceeded
+    @Test(timeOut = 10_000)
+    public void testQueryStartsWhenTaskLimitClears()
+    {
+        RootInternalResourceGroup root = new RootInternalResourceGroup(
+                "root",
+                (group, export) -> {},
+                directExecutor(),
+                ignored -> Optional.empty(),
+                rg -> false,
+                createNodeManager(),
+                createClusterResourceChecker(),
+                QueryPacingContext.NOOP);
+        root.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        root.setMaxQueuedQueries(10);
+        root.setHardConcurrencyLimit(10);
+
+        // Set task limit exceeded
+        root.setTaskLimitExceeded(true);
+
+        // Submit queries - they should be queued
+        MockManagedQueryExecution query1 = new MockManagedQueryExecution(0);
+        query1.startWaitingForPrerequisites();
+        root.run(query1);
+        MockManagedQueryExecution query2 = new MockManagedQueryExecution(0);
+        query2.startWaitingForPrerequisites();
+        root.run(query2);
+
+        assertEquals(query1.getState(), QUEUED);
+        assertEquals(query2.getState(), QUEUED);
+        assertEquals(root.getQueuedQueries(), 2);
+        assertEquals(root.getRunningQueries(), 0);
+
+        // Clear task limit
+        root.setTaskLimitExceeded(false);
+
+        // Process queued queries - they should now start
+        root.processQueuedQueries();
+
+        assertEquals(query1.getState(), RUNNING);
+        assertEquals(query2.getState(), RUNNING);
+        assertEquals(root.getQueuedQueries(), 0);
+        assertEquals(root.getRunningQueries(), 2);
+    }
+
+    // Tests that queries in a subgroup hierarchy are properly queued and started when task limit changes
+    @Test(timeOut = 10_000)
+    public void testTaskLimitExceededWithSubgroups()
+    {
+        RootInternalResourceGroup root = new RootInternalResourceGroup(
+                "root",
+                (group, export) -> {},
+                directExecutor(),
+                ignored -> Optional.empty(),
+                rg -> false,
+                createNodeManager(),
+                createClusterResourceChecker(),
+                QueryPacingContext.NOOP);
+        root.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        root.setMaxQueuedQueries(10);
+        root.setHardConcurrencyLimit(10);
+
+        InternalResourceGroup groupA = root.getOrCreateSubGroup("A", true);
+        groupA.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        groupA.setMaxQueuedQueries(10);
+        groupA.setHardConcurrencyLimit(10);
+
+        InternalResourceGroup groupG = groupA.getOrCreateSubGroup("G", true);
+        groupG.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        groupG.setMaxQueuedQueries(10);
+        groupG.setHardConcurrencyLimit(10);
+
+        // Set task limit exceeded
+        root.setTaskLimitExceeded(true);
+
+        // Submit a query to leaf group G - it should be queued
+        MockManagedQueryExecution query1 = new MockManagedQueryExecution(0);
+        query1.startWaitingForPrerequisites();
+        groupG.run(query1);
+
+        assertEquals(query1.getState(), QUEUED);
+        assertEquals(groupG.getQueuedQueries(), 1);
+        assertEquals(groupG.getRunningQueries(), 0);
+
+        // Clear task limit and process queued queries
+        root.setTaskLimitExceeded(false);
+        root.processQueuedQueries();
+
+        // Query should now be running
+        assertEquals(query1.getState(), RUNNING);
+        assertEquals(groupG.getQueuedQueries(), 0);
+        assertEquals(groupG.getRunningQueries(), 1);
+    }
+
+    // Tests that when task limit is exceeded, queries already running continue, but new ones are queued
+    @Test(timeOut = 10_000)
+    public void testTaskLimitExceededDoesNotAffectRunningQueries()
+    {
+        RootInternalResourceGroup root = new RootInternalResourceGroup(
+                "root",
+                (group, export) -> {},
+                directExecutor(),
+                ignored -> Optional.empty(),
+                rg -> false,
+                createNodeManager(),
+                createClusterResourceChecker(),
+                QueryPacingContext.NOOP);
+        root.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        root.setMaxQueuedQueries(10);
+        root.setHardConcurrencyLimit(10);
+
+        // Submit a query before task limit is exceeded - it should run
+        MockManagedQueryExecution query1 = new MockManagedQueryExecution(0);
+        query1.startWaitingForPrerequisites();
+        root.run(query1);
+        assertEquals(query1.getState(), RUNNING);
+
+        // Now set task limit exceeded
+        root.setTaskLimitExceeded(true);
+
+        // Submit another query - it should be queued
+        MockManagedQueryExecution query2 = new MockManagedQueryExecution(0);
+        query2.startWaitingForPrerequisites();
+        root.run(query2);
+        assertEquals(query2.getState(), QUEUED);
+
+        // The first query should still be running
+        assertEquals(query1.getState(), RUNNING);
+        assertEquals(root.getRunningQueries(), 1);
+        assertEquals(root.getQueuedQueries(), 1);
+    }
+
+    // Tests that task limit transitions work correctly with multiple cycles
+    @Test(timeOut = 10_000)
+    public void testTaskLimitExceededMultipleCycles()
+    {
+        RootInternalResourceGroup root = new RootInternalResourceGroup(
+                "root",
+                (group, export) -> {},
+                directExecutor(),
+                ignored -> Optional.empty(),
+                rg -> false,
+                createNodeManager(),
+                createClusterResourceChecker(),
+                QueryPacingContext.NOOP);
+        root.setSoftMemoryLimit(new DataSize(1, MEGABYTE));
+        root.setMaxQueuedQueries(10);
+        root.setHardConcurrencyLimit(10);
+
+        // Cycle 1: Task limit exceeded, query queued
+        root.setTaskLimitExceeded(true);
+        MockManagedQueryExecution query1 = new MockManagedQueryExecution(0);
+        query1.startWaitingForPrerequisites();
+        root.run(query1);
+        assertEquals(query1.getState(), QUEUED);
+
+        // Clear task limit, query starts
+        root.setTaskLimitExceeded(false);
+        root.processQueuedQueries();
+        assertEquals(query1.getState(), RUNNING);
+
+        // Cycle 2: Task limit exceeded again, new query queued
+        root.setTaskLimitExceeded(true);
+        MockManagedQueryExecution query2 = new MockManagedQueryExecution(0);
+        query2.startWaitingForPrerequisites();
+        root.run(query2);
+        assertEquals(query2.getState(), QUEUED);
+        assertEquals(query1.getState(), RUNNING);  // query1 still running
+
+        // Complete query1, processQueuedQueries should not start query2 (task limit still exceeded)
+        query1.complete();
+        root.processQueuedQueries();
+        assertEquals(query2.getState(), QUEUED);  // Still queued because task limit exceeded
+
+        // Clear task limit, query2 starts
+        root.setTaskLimitExceeded(false);
+        root.processQueuedQueries();
+        assertEquals(query2.getState(), RUNNING);
+    }
 }
diff --git a/presto-tests/src/test/java/com/facebook/presto/tests/TestQueryTaskLimit.java b/presto-tests/src/test/java/com/facebook/presto/tests/TestQueryTaskLimit.java
@@ -94,7 +94,7 @@ public void testQueuingWhenTaskLimitExceeds()
     {
         ImmutableMap<String, String> extraProperties = ImmutableMap.<String, String>builder()
                 .put("experimental.spill-enabled", "false")
-                .put("experimental.max-total-running-task-count-to-not-execute-new-query", "2")
+                .put("max-total-running-task-count-to-not-execute-new-query", "2")
                 .build();
 
         try (DistributedQueryRunner queryRunner = createQueryRunner(defaultSession, extraProperties)) {

Original file line number	Diff line number	Diff line change
`@@ -321,7 +321,8 @@ public int getMaxQueryRunningTaskCount()`
`321`	`321`	`return maxQueryRunningTaskCount;`
`322`	`322`	`}`
`323`	`323`
`324`		`- @Config("experimental.max-total-running-task-count-to-not-execute-new-query")`
	`324`	`+ @LegacyConfig("experimental.max-total-running-task-count-to-not-execute-new-query")`
	`325`	`+ @Config("max-total-running-task-count-to-not-execute-new-query")`
`325`	`326`	`@ConfigDescription("Keep new queries in the queue if total task count exceeds this threshold")`
`326`	`327`	`public QueryManagerConfig setMaxTotalRunningTaskCountToNotExecuteNewQuery(int maxTotalRunningTaskCountToNotExecuteNewQuery)`
`327`	`328`	`{`
Original file line number	Diff line number	Diff line change
`@@ -94,7 +94,7 @@ public void testQueuingWhenTaskLimitExceeds()`
`94`	`94`	`{`
`95`	`95`	`ImmutableMap<String, String> extraProperties = ImmutableMap.<String, String>builder()`
`96`	`96`	`.put("experimental.spill-enabled", "false")`
`97`		`- .put("experimental.max-total-running-task-count-to-not-execute-new-query", "2")`
	`97`	`+ .put("max-total-running-task-count-to-not-execute-new-query", "2")`
`98`	`98`	`.build();`
`99`	`99`
`100`	`100`	`try (DistributedQueryRunner queryRunner = createQueryRunner(defaultSession, extraProperties)) {`