Skip to content

TPCH queries hang in benchmarks #228

@gabotechs

Description

@gabotechs

While playing with Add AWS CDK-based benchmarking environment, I'm seeing that queries are likely to hang forever while running the benchmarks, printing Query still running... in a loop with nothing happening.

I wonder if there is an issue that's not captured by the current tests because of the fact they use local files rather than S3 ones.

I attempted printing the StageKeys in the ttl_map that stores the state of each query to see if there's something wrong there, and this is what I saw:

output log
[2025-11-18T12:40:46Z INFO  worker] Executing query...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:40:51Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:40:56Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:01Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:06Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:11Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:16Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:21Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:26Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:31Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:36Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:41Z INFO  worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:46Z INFO  worker] Query still running...
[]
[2025-11-18T12:41:51Z INFO  worker] Query still running...
[]
[2025-11-18T12:41:56Z INFO  worker] Query still running...
[]
[2025-11-18T12:42:01Z INFO  worker] Query still running...
[]
[2025-11-18T12:42:06Z INFO  worker] Query still running...
[]
[2025-11-18T12:42:11Z INFO  worker] Query still running...
...
...
...
it goes like this forever

It looks like there's a deadlock somewhere in the code that completely stalls the query.

It seems to be triggered by some queries in particular, for example, TPCH query 7. It can be reproduced ~50% of the times running:

npm run datafusion-bench  -- --sf 10 --files-per-task 4 --query 7

With the remote benchmarks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions