Skip to content

Performance: Severe Lock Contention (futex) on TPC-H Q21 compared to DuckDB #26846

@fl0jc

Description

@fl0jc

Summary

While benchmarking TPC-H (Scale Factor 1.0) on a 16-core Intel Xeon (2 NUMA nodes), I noticed a significant performance bottleneck specifically on Query 21.

Using strace -c and /usr/bin/time -v (averaged over 100 iterations), I profiled both Polars and DuckDB. The profiling shows that Polars is suffering from severe lock contention during this query, spending ~96% of its syscall time in futex, whereas DuckDB handles the concurrency much more efficiently (~49% futex time) and executes almost 2x faster.

It seems the current execution plan or thread pool distribution for this specific heavy join topology (Q21) causes threads to clash and wait on mutexes extensively. I thought this low-level trace might be helpful for the team to pinpoint scheduling/hash-join bottlenecks!

Environment

  • OS: Linux
  • CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (16 physical cores, 2 NUMA nodes)
  • RAM: 251 GiB
  • Polars API: Python

Reproducible Setup

I am reading from local Parquet files (TPC-H SF 1.0). The benchmark was run using standard python modules for both engines with 100 iterations to stabilize the measure:

# Polars setup
export POLARS_MAX_THREADS=16
/usr/bin/time -v python -m queries.polars.q21 --dataset_base_dir="data/tables/scale-1.0" --iterations 100
strace -c python -m queries.polars.q21 --dataset_base_dir="data/tables/scale-1.0"

# DuckDB setup
/usr/bin/time -v python -m queries.duckdb.q21 --dataset_base_dir="data/tables/scale-1.0" --iterations 100
strace -c python -m queries.duckdb.q21 --dataset_base_dir="data/tables/scale-1.0"

Profiling Evidence

1. strace Output (The futex bottleneck)

Polars (96.40% time spent in futex):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 96.40    0.577537        6346        91         3 futex
  1.37    0.008198           4      1911       234 stat
  0.91    0.005480           8       643        93 openat
  0.37    0.002237           2       927           read
...
------ ----------- ----------- --------- --------- ------------------
100.00    0.599103          78      7664       774 total

DuckDB (Control - 49.39% time spent in futex):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 49.39    0.054039         153       352        91 futex
 23.04    0.025211          20      1234           madvise
 11.79    0.012903         208        62           pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.109420          15      6965       594 total
2. Context Switches & Execution Time (/usr/bin/time)

Polars (High Context Switches):

Code block 'Run polars query 21' took: 0.36937 s
        Voluntary context switches: 4946
        Involuntary context switches: 2749

DuckDB (Low Context Switches):

Code block 'Run duckdb query 21' took: 0.20189 s
        Voluntary context switches: 2588
        Involuntary context switches: 160

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance issues or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions