Commit 07e63ed
authored
Fix TopK aggregation for UTF-8/Utf8View group keys and add safe fallback for unsupported string aggregates (#19285)
## Which issue does this PR close?
* Closes #19219.
## Rationale for this change
A `GROUP BY ... ORDER BY <aggregate> ... LIMIT` query can trigger
DataFusion’s TopK aggregation optimization. In affected releases,
queries grouping by text columns—especially `Utf8View` produced via SQL
`varchar` mappings / `arrow_cast`—could fail at execution time with an
error such as `Can't group type: Utf8View`.
This happens because the optimizer may select the TopK aggregation path
even when the underlying TopK data structures (heap/hash table) do not
fully support the specific key/value Arrow types involved. Disabling
`datafusion.optimizer.enable_topk_aggregation` is a workaround, but it
forces users to trade correctness for performance.
This PR makes TopK type support explicit and consistent across the
optimizer and execution, adds support for UTF-8 string value heaps, and
ensures unsupported key/value combinations fall back to the standard
aggregation implementation rather than panicking.
## What changes are included in this PR?
* **Centralized TopK type validation**
* Introduced `topk_types_supported(key_type, value_type)` (in
`physical-plan/src/aggregates/mod.rs`) to validate both grouping key and
min/max value types.
* Optimizer now uses this shared check rather than duplicating partial
type logic.
* **Safer AggregateExec cloning for limit pushdown**
* Added `AggregateExec::with_new_limit` to clone an aggregate exec while
overriding only the TopK `limit` hint, avoiding manual reconstruction
and ensuring plan properties/fields remain consistent.
* **TopK hash table improvements + helper functions**
* Added `is_supported_hash_key_type` helper for grouping key
compatibility checks.
* Refactored string key extraction to a single helper function.
* Added `find_or_insert` entry API to avoid double lookups and unify
insertion behavior.
* **TopK heap support for string aggregate values**
* Added `StringHeap` implementation supporting `Utf8`, `LargeUtf8`, and
`Utf8View` aggregate values using lexicographic ordering.
* Added `is_supported_heap_type` helper for aggregate value
compatibility.
* Updated `new_heap` to create `StringHeap` for supported string types
and return a clearer error message for unsupported types.
* **Debug contract in TopK stream**
* Added a debug assertion in `GroupedTopKAggregateStream` documenting
that type validation should have already happened (optimizer +
can_use_topk), without affecting release builds.
## Are these changes tested?
Yes.
* Added a new physical optimizer test covering UTF-8 grouping with:
1. **Supported** numeric `max/min` value (TopK should be used and
results correct)
2. **Unsupported** string `max/min` value (must fall back to standard
aggregation and not use `GroupedTopKAggregateStream`)
* Added unit tests in `PriorityMap` to validate lexicographic `min/max`
tracking for:
* `Utf8`
* `LargeUtf8`
* `Utf8View`
* Added SQLLogicTest coverage (`aggregates_topk.slt`) for:
* `varchar` tables
* `Utf8View` via `arrow_cast`
* `EXPLAIN` verification that TopK limit propagation is applied and
plans remain stable
* Regression case for `max(trace_id)` with `ORDER BY ... LIMIT`
## Are there any user-facing changes?
Yes (bug fix).
* Queries that group by text columns (including `Utf8View`) and use
`ORDER BY <aggregate> ... LIMIT` should no longer error.
* TopK aggregation now supports UTF-8 string aggregate values for
min/max (lexicographic ordering) where applicable.
* For unsupported type combinations, DataFusion will fall back
gracefully to the standard aggregation path instead of panicking.
No breaking public API changes are intended. The only new public helper
APIs are internal to the physical plan modules.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.1 parent b7091c0 commit 07e63ed
File tree
9 files changed
+698
-106
lines changed- datafusion
- core
- benches
- tests/physical_optimizer
- physical-optimizer/src
- physical-plan/src/aggregates
- topk
- sqllogictest/test_files
9 files changed
+698
-106
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
| |||
52 | 54 | | |
53 | 55 | | |
54 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
55 | 62 | | |
56 | 63 | | |
57 | 64 | | |
| |||
72 | 79 | | |
73 | 80 | | |
74 | 81 | | |
75 | | - | |
| 82 | + | |
76 | 83 | | |
77 | 84 | | |
78 | 85 | | |
| |||
99 | 106 | | |
100 | 107 | | |
101 | 108 | | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
102 | 136 | | |
103 | 137 | | |
104 | | - | |
| 138 | + | |
105 | 139 | | |
106 | 140 | | |
107 | 141 | | |
| |||
170 | 204 | | |
171 | 205 | | |
172 | 206 | | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
173 | 256 | | |
174 | 257 | | |
175 | 258 | | |
| |||
Lines changed: 86 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
| 26 | + | |
25 | 27 | | |
26 | 28 | | |
| 29 | + | |
27 | 30 | | |
| 31 | + | |
28 | 32 | | |
29 | 33 | | |
30 | 34 | | |
| |||
38 | 42 | | |
39 | 43 | | |
40 | 44 | | |
| 45 | + | |
41 | 46 | | |
42 | 47 | | |
43 | 48 | | |
| |||
316 | 321 | | |
317 | 322 | | |
318 | 323 | | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
24 | 23 | | |
25 | 24 | | |
26 | 25 | | |
27 | 26 | | |
28 | 27 | | |
29 | | - | |
| 28 | + | |
30 | 29 | | |
31 | 30 | | |
32 | 31 | | |
| |||
55 | 54 | | |
56 | 55 | | |
57 | 56 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
| 57 | + | |
| 58 | + | |
63 | 59 | | |
64 | 60 | | |
65 | 61 | | |
| |||
72 | 68 | | |
73 | 69 | | |
74 | 70 | | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
| 71 | + | |
85 | 72 | | |
86 | 73 | | |
87 | 74 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
| 44 | + | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
| 67 | + | |
| 68 | + | |
67 | 69 | | |
68 | 70 | | |
69 | 71 | | |
| |||
72 | 74 | | |
73 | 75 | | |
74 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
75 | 88 | | |
76 | 89 | | |
77 | 90 | | |
| |||
553 | 566 | | |
554 | 567 | | |
555 | 568 | | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
556 | 589 | | |
557 | 590 | | |
558 | 591 | | |
| |||
0 commit comments