Commit 88b6343
authored
fix(scheduler): Coordinator Task Throttling Bug (#27146)
## Description
With coordinator task based throttling (queueing) enabled, we run into
an issue where certain resource groups are never updated to be eligible
to run. This occurs when the resource group is created during a task
throttling period and canRun returns false, resulting in the resource
group never being added as an eligible subgroup on creation. When we
exit task throttling, an eligibility update is never triggered. if this
group doesnt have a new query added after we exit task throttling, its
status is never updated.
Changes:
1. move the isTaskLimitExceeded check from canRunMore to
internalStartNext, canRunMore will return true allowing the group to be
marked as eligible, but internalStartNext will prevent the group from
running more queries.
2. add check to enqueue immediate execution candidates if task
throttling
3. remove experimental from session property
4. add tests to ensure resource groups properly queue/run queries with
task limits (should this be in resourceGroups or testQueryTaskLimit?)
Meta Internal review by: spershin
Meta Internal Differential Revision: D92632990
## Motivation and Context
Coordinator memory is being overloaded by queries with large task
counts. There needs to be safeguards on this outside of just RG's. The
existing coordinator task throttling property has some issues which are
fixed by this PR.
## Impact
Coordinator task throttling no longer causes stuck resource groups.
Config renamed from
experimental.max-total-running-task-count-to-not-execute-new-query ->
max-total-running-task-count-to-not-execute-new-query, however the old
config will be kept as a legacy config for backwards compatibility
Coordinator task throttling, when used in conjunction with query-pacing,
should limit the number of tasks on the cluster close to the limit.
## Test Plan
Bug Reproduction
Set task limit to 1 on a test cluster.
Trigger multiple queries that peak at 10-30 tasks and have execution
time from 10-30 secs
<img width="2732" height="1482" alt="image"
src="https://github.com/user-attachments/assets/3bef60b3-ee39-4190-8e0f-b972736876af"
/>
repro with larger query suite
<img width="2594" height="1254" alt="image"
src="https://github.com/user-attachments/assets/dd26ed05-4c33-4c46-ae70-fed41098c389"
/>
Test:
build and push again to test cluster, test previous repro
Seems to kick in as expected, cluster submits a lot of queries as
running before TaskLimitExceeded fires, after which it seems to run 1-3
queries at a time for the remainder of the queue. However it seemed like
the cluster was still admitting queries slowly even in a task throttling
state
<img width="870" height="686" alt="image"
src="https://github.com/user-attachments/assets/04c14abd-8c74-4c90-9625-2b2119ee2fc6"
/>
Following the previous fix, it was noticed that internalStartNext would
not prevent immediate executions, only queued queries. This was then
patched to block immediate executions during task throttling periods to
prevent queries from running while in a task throttling state.
Test with second fix
<img width="2438" height="1150" alt="image"
src="https://github.com/user-attachments/assets/3ed7d10f-2d5f-47f3-9969-e343b467ef5f"
/>
The spikes in this fix are because multiple queries can be admitted with
no pacing, before re-entering task throttling state. With query pacing,
this aspect should be mitigated.
## Contributor checklist
- [ ] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [ ] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [ ] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [ ] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [ ] Adequate tests were added if applicable.
- [ ] CI passed.
## Summary by Sourcery
Fix coordinator task-based throttling so resource groups correctly queue
and start queries when task limits are exceeded and later cleared.
Bug Fixes:
- Ensure resource groups remain eligible and properly queue queries
instead of silently starving when the coordinator task limit is
exceeded.
- Prevent new queries from starting immediately when the coordinator is
overloaded while still allowing existing running queries to continue.
Enhancements:
- Refine admission control in resource groups to consider coordinator
overload separately from eligibility and concurrency checks.
- Promote the task-limit-based throttling session property from
experimental by renaming its configuration key.
Tests:
- Add unit tests covering query queuing and execution across task-limit
transitions, including subgroup hierarchies and multiple throttle
cycles.
- Update configuration and task-limit integration tests to use the
non-experimental task throttling property.
## Release Notes
Please follow [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines)
and fill in the release notes below.
```
== RELEASE NOTES ==
General Changes
* Fix a bug where queries could get permanently stuck in resource groups when coordinator task-based throttling (``experimental.max-total-running-task-count-to-not-execute-new-query``) is enabled.
* Replace experimental.max-total-running-task-count-to-not-execute-new-query with max-total-running-task-count-to-not-execute-new-query, this is backwards compatible1 parent 8236897 commit 88b6343
File tree
6 files changed
+265
-8
lines changed- presto-docs/src/main/sphinx/admin
- presto-main-base/src
- main/java/com/facebook/presto/execution
- resourceGroups
- test/java/com/facebook/presto/execution
- resourceGroups
- presto-tests/src/test/java/com/facebook/presto/tests
6 files changed
+265
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1368 | 1368 | | |
1369 | 1369 | | |
1370 | 1370 | | |
| 1371 | + | |
| 1372 | + | |
| 1373 | + | |
| 1374 | + | |
| 1375 | + | |
| 1376 | + | |
| 1377 | + | |
| 1378 | + | |
| 1379 | + | |
| 1380 | + | |
| 1381 | + | |
| 1382 | + | |
| 1383 | + | |
| 1384 | + | |
| 1385 | + | |
| 1386 | + | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
| 1396 | + | |
| 1397 | + | |
| 1398 | + | |
| 1399 | + | |
| 1400 | + | |
| 1401 | + | |
| 1402 | + | |
| 1403 | + | |
| 1404 | + | |
1371 | 1405 | | |
1372 | 1406 | | |
1373 | 1407 | | |
| |||
Lines changed: 2 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
321 | 321 | | |
322 | 322 | | |
323 | 323 | | |
324 | | - | |
| 324 | + | |
| 325 | + | |
325 | 326 | | |
326 | 327 | | |
327 | 328 | | |
| |||
Lines changed: 16 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
740 | 740 | | |
741 | 741 | | |
742 | 742 | | |
743 | | - | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
744 | 755 | | |
745 | 756 | | |
746 | 757 | | |
| |||
914 | 925 | | |
915 | 926 | | |
916 | 927 | | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
917 | 932 | | |
918 | 933 | | |
919 | 934 | | |
| |||
1052 | 1067 | | |
1053 | 1068 | | |
1054 | 1069 | | |
1055 | | - | |
1056 | | - | |
1057 | | - | |
1058 | | - | |
1059 | 1070 | | |
1060 | 1071 | | |
1061 | 1072 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
105 | | - | |
| 105 | + | |
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| |||
Lines changed: 211 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1098 | 1098 | | |
1099 | 1099 | | |
1100 | 1100 | | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
| 1160 | + | |
| 1161 | + | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
| 1165 | + | |
| 1166 | + | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + | |
| 1173 | + | |
| 1174 | + | |
| 1175 | + | |
| 1176 | + | |
| 1177 | + | |
| 1178 | + | |
| 1179 | + | |
| 1180 | + | |
| 1181 | + | |
| 1182 | + | |
| 1183 | + | |
| 1184 | + | |
| 1185 | + | |
| 1186 | + | |
| 1187 | + | |
| 1188 | + | |
| 1189 | + | |
| 1190 | + | |
| 1191 | + | |
| 1192 | + | |
| 1193 | + | |
| 1194 | + | |
| 1195 | + | |
| 1196 | + | |
| 1197 | + | |
| 1198 | + | |
| 1199 | + | |
| 1200 | + | |
| 1201 | + | |
| 1202 | + | |
| 1203 | + | |
| 1204 | + | |
| 1205 | + | |
| 1206 | + | |
| 1207 | + | |
| 1208 | + | |
| 1209 | + | |
| 1210 | + | |
| 1211 | + | |
| 1212 | + | |
| 1213 | + | |
| 1214 | + | |
| 1215 | + | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
| 1224 | + | |
| 1225 | + | |
| 1226 | + | |
| 1227 | + | |
| 1228 | + | |
| 1229 | + | |
| 1230 | + | |
| 1231 | + | |
| 1232 | + | |
| 1233 | + | |
| 1234 | + | |
| 1235 | + | |
| 1236 | + | |
| 1237 | + | |
| 1238 | + | |
| 1239 | + | |
| 1240 | + | |
| 1241 | + | |
| 1242 | + | |
| 1243 | + | |
| 1244 | + | |
| 1245 | + | |
| 1246 | + | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
| 1251 | + | |
| 1252 | + | |
| 1253 | + | |
| 1254 | + | |
| 1255 | + | |
| 1256 | + | |
| 1257 | + | |
| 1258 | + | |
| 1259 | + | |
| 1260 | + | |
| 1261 | + | |
| 1262 | + | |
| 1263 | + | |
| 1264 | + | |
| 1265 | + | |
| 1266 | + | |
| 1267 | + | |
| 1268 | + | |
| 1269 | + | |
| 1270 | + | |
| 1271 | + | |
| 1272 | + | |
| 1273 | + | |
| 1274 | + | |
| 1275 | + | |
| 1276 | + | |
| 1277 | + | |
| 1278 | + | |
| 1279 | + | |
| 1280 | + | |
| 1281 | + | |
| 1282 | + | |
| 1283 | + | |
| 1284 | + | |
| 1285 | + | |
| 1286 | + | |
| 1287 | + | |
| 1288 | + | |
| 1289 | + | |
| 1290 | + | |
| 1291 | + | |
| 1292 | + | |
| 1293 | + | |
| 1294 | + | |
| 1295 | + | |
| 1296 | + | |
| 1297 | + | |
| 1298 | + | |
| 1299 | + | |
| 1300 | + | |
| 1301 | + | |
| 1302 | + | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + | |
| 1306 | + | |
| 1307 | + | |
| 1308 | + | |
| 1309 | + | |
| 1310 | + | |
| 1311 | + | |
1101 | 1312 | | |
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
97 | | - | |
| 97 | + | |
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
| |||
0 commit comments