-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Summary
While benchmarking TPC-H (Scale Factor 1.0) on a 16-core Intel Xeon (2 NUMA nodes), I noticed a significant performance bottleneck specifically on Query 21.
Using strace -c and /usr/bin/time -v (averaged over 100 iterations), I profiled both Polars and DuckDB. The profiling shows that Polars is suffering from severe lock contention during this query, spending ~96% of its syscall time in futex, whereas DuckDB handles the concurrency much more efficiently (~49% futex time) and executes almost 2x faster.
It seems the current execution plan or thread pool distribution for this specific heavy join topology (Q21) causes threads to clash and wait on mutexes extensively. I thought this low-level trace might be helpful for the team to pinpoint scheduling/hash-join bottlenecks!
Environment
- OS: Linux
- CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (16 physical cores, 2 NUMA nodes)
- RAM: 251 GiB
- Polars API: Python
Reproducible Setup
I am reading from local Parquet files (TPC-H SF 1.0). The benchmark was run using standard python modules for both engines with 100 iterations to stabilize the measure:
# Polars setup
export POLARS_MAX_THREADS=16
/usr/bin/time -v python -m queries.polars.q21 --dataset_base_dir="data/tables/scale-1.0" --iterations 100
strace -c python -m queries.polars.q21 --dataset_base_dir="data/tables/scale-1.0"
# DuckDB setup
/usr/bin/time -v python -m queries.duckdb.q21 --dataset_base_dir="data/tables/scale-1.0" --iterations 100
strace -c python -m queries.duckdb.q21 --dataset_base_dir="data/tables/scale-1.0"Profiling Evidence
1. strace Output (The futex bottleneck)
Polars (96.40% time spent in futex):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
96.40 0.577537 6346 91 3 futex
1.37 0.008198 4 1911 234 stat
0.91 0.005480 8 643 93 openat
0.37 0.002237 2 927 read
...
------ ----------- ----------- --------- --------- ------------------
100.00 0.599103 78 7664 774 total
DuckDB (Control - 49.39% time spent in futex):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.39 0.054039 153 352 91 futex
23.04 0.025211 20 1234 madvise
11.79 0.012903 208 62 pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00 0.109420 15 6965 594 total
2. Context Switches & Execution Time (/usr/bin/time)
Polars (High Context Switches):
Code block 'Run polars query 21' took: 0.36937 s
Voluntary context switches: 4946
Involuntary context switches: 2749
DuckDB (Low Context Switches):
Code block 'Run duckdb query 21' took: 0.20189 s
Voluntary context switches: 2588
Involuntary context switches: 160