Skip to content

Conversation

@ahmed-mez
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This is another attempt to solve the long polls observed and explained in #18907. Mostly inspired by @alamb's comment #18906 (comment) on the first attempt PR.

What changes are included in this PR?

This PR is an exploratory implementation to gauge the scope and complexity of supporting incremental group emission in GroupedHashAggregateStream. The goal is to understand how large this change would be and identify a path to ship it in smaller, reviewable increments.

Are these changes tested?

  • Test coverage from preexisting aggregate tests
  • Benchmark incremental_emit included to prove the benefits of these changes

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Dec 30, 2025
@ahmed-mez ahmed-mez force-pushed the ahmed/long-poll-take-2 branch 2 times, most recently from 9ebbf95 to eacfedb Compare December 30, 2025 13:20
@alamb
Copy link
Contributor

alamb commented Jan 5, 2026

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing ahmed/long-poll-take-2 (224ce3a) to a51e3a0 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@ahmed-mez
Copy link
Contributor Author

Thank you @alamb for triggering the benchmarks. The results haven't been posted by the bot yet, I wonder if it's expected.

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing ahmed/long-poll-take-2 (224ce3a) to a51e3a0 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

Thank you @alamb for triggering the benchmarks. The results haven't been posted by the bot yet, I wonder if it's expected.

No I am not sure what is going on -- my runner script died. I just restarted it and hopefully we'll get some info

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

The reason the benchmarks are failing I think is that that they have basically slowed down to a crawl.

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 32 iteration 0 took 147686.4 ms and returned 10 rows

I killed it after this

Perhaps you can run the benchmarks locally

benchmarks/bench.sh data clickbench_partitioned
benchmarks/bench.sh run clickbench_partitioned

@ahmed-mez
Copy link
Contributor Author

The reason the benchmarks are failing I think is that that they have basically slowed down to a crawl.

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 32 iteration 0 took 147686.4 ms and returned 10 rows

I killed it after this

Perhaps you can run the benchmarks locally

benchmarks/bench.sh data clickbench_partitioned
benchmarks/bench.sh run clickbench_partitioned

Oh that's interesting! That benchmark query runs much more faster locally (same branch and commit)

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10;

Query 32 iteration 0 took 11717.4 ms and returned 10 rows
Query 32 iteration 1 took 11660.8 ms and returned 10 rows
Query 32 iteration 2 took 11855.8 ms and returned 10 rows
Query 32 iteration 3 took 12496.2 ms and returned 10 rows
Query 32 iteration 4 took 12555.9 ms and returned 10 rows
Query 32 avg time: 12057.21 ms

I'll keep digging and try to compare to base locally.
Apart from that, is the failed benchmark CI job accessible somehow for further debugging?

@ahmed-mez
Copy link
Contributor Author

This is the comparison output and it's not great 😞 At first glance mainly topk queries are impacted by the perf regression

Comparing main and ahmed_long-poll-take-2
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ ahmed_long-poll-take-2 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    0.87 ms │                0.82 ms │ +1.06x faster │
│ QQuery 1     │    8.58 ms │                8.22 ms │     no change │
│ QQuery 2     │   29.57 ms │               28.78 ms │     no change │
│ QQuery 3     │   27.93 ms │               24.25 ms │ +1.15x faster │
│ QQuery 4     │  185.18 ms │              289.26 ms │  1.56x slower │
│ QQuery 5     │  253.48 ms │             2514.65 ms │  9.92x slower │
│ QQuery 6     │    0.90 ms │                0.85 ms │ +1.06x faster │
│ QQuery 7     │    8.87 ms │                9.46 ms │  1.07x slower │
│ QQuery 8     │  215.91 ms │              535.34 ms │  2.48x slower │
│ QQuery 9     │  352.77 ms │              351.77 ms │     no change │
│ QQuery 10    │   59.07 ms │               66.03 ms │  1.12x slower │
│ QQuery 11    │   70.65 ms │               74.87 ms │  1.06x slower │
│ QQuery 12    │  246.81 ms │             2539.77 ms │ 10.29x slower │
│ QQuery 13    │  335.89 ms │              955.45 ms │  2.84x slower │
│ QQuery 14    │  221.80 ms │              332.14 ms │  1.50x slower │
│ QQuery 15    │  207.82 ms │              454.78 ms │  2.19x slower │
│ QQuery 16    │  455.08 ms │             2316.21 ms │  5.09x slower │
│ QQuery 17    │  441.56 ms │             1630.44 ms │  3.69x slower │
│ QQuery 18    │  898.82 ms │             9670.17 ms │ 10.76x slower │
│ QQuery 19    │   20.34 ms │               21.70 ms │  1.07x slower │
│ QQuery 20    │  470.21 ms │              480.47 ms │     no change │
│ QQuery 21    │  507.89 ms │              523.32 ms │     no change │
│ QQuery 22    │  827.85 ms │              813.45 ms │     no change │
│ QQuery 23    │ 2344.75 ms │             2317.04 ms │     no change │
│ QQuery 24    │   52.42 ms │               52.83 ms │     no change │
│ QQuery 25    │  113.44 ms │              109.39 ms │     no change │
│ QQuery 26    │   53.71 ms │               53.89 ms │     no change │
│ QQuery 27    │  581.05 ms │              572.82 ms │     no change │
│ QQuery 28    │ 5174.20 ms │             5402.70 ms │     no change │
│ QQuery 29    │  221.13 ms │              221.62 ms │     no change │
│ QQuery 30    │  216.52 ms │              293.76 ms │  1.36x slower │
│ QQuery 31    │  228.53 ms │              386.75 ms │  1.69x slower │
│ QQuery 32    │ 1214.74 ms │            11660.79 ms │  9.60x slower │
│ QQuery 33    │ 1108.52 ms │                   FAIL │  incomparable │
│ QQuery 34    │ 1164.78 ms │            29807.41 ms │ 25.59x slower │
│ QQuery 35    │  294.71 ms │              837.63 ms │  2.84x slower │
│ QQuery 36    │   31.74 ms │               41.91 ms │  1.32x slower │
│ QQuery 37    │   21.16 ms │               24.12 ms │  1.14x slower │
│ QQuery 38    │   32.40 ms │               33.90 ms │     no change │
│ QQuery 39    │   52.27 ms │               51.59 ms │     no change │
│ QQuery 40    │    9.27 ms │                8.25 ms │ +1.12x faster │
│ QQuery 41    │    8.02 ms │                7.70 ms │     no change │
│ QQuery 42    │    6.86 ms │                6.72 ms │     no change │
└──────────────┴────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                     │ 17669.59ms │
│ Total Time (ahmed_long-poll-take-2)   │ 75533.04ms │
│ Average Time (main)                   │   420.70ms │
│ Average Time (ahmed_long-poll-take-2) │  1798.41ms │
│ Queries Faster                        │          4 │
│ Queries Slower                        │         21 │
│ Queries with No Change                │         17 │
│ Queries with Failure                  │          1 │
└───────────────────────────────────────┴────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long poll (async runtime stall) in HashAggregate when emitting large number of groups

3 participants