Support `--threads` and `--workers` on TPCH benchmarks #130

gabotechs · 2025-09-08T13:27:28Z

Before, the distributed benchmarks where spawning a localhost worker in the same process as the benchmark itself, without applying any constraint to the resources used for either the localhost worker or the overall process.

This PR ships the ability to run workers as different processes with constrained resources and benchmark against them.

From the README.md:

Running TPCH benchmarks in distributed mode

Running the benchmarks in distributed mode implies:

running 1 or more workers in separate terminals
running the benchmarks in an additional terminal

The workers can be spawned by passing the --spawn <port> flag, for example, for spawning 3 workers:

cargo run -p datafusion-distributed-benchmarks --release -- tpch --spawn 8000

cargo run -p datafusion-distributed-benchmarks --release -- tpch --spawn 8001

cargo run -p datafusion-distributed-benchmarks --release -- tpch --spawn 8002

With the three workers running in separate terminals, the TPCH benchmarks can be run in distributed mode with:

cargo run -p datafusion-distributed-benchmarks --release -- tpch --workers 8000,8001,8002

A good way of measuring the impact of distribution is to limit the physical threads each worker can use. For example,
it's expected that running 8 workers with 2 physical threads each one (8 * 2 = 16 total) is faster than running in
single-node with just 2 threads (1 * 3 = 2 total).

cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8000 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8001 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8002 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8003 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8004 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8005 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8006 & 
cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --spawn 8007 &

cargo run -p datafusion-distributed-benchmarks --release -- tpch -m --threads 2 --workers 8000,8001,8002,8003,8004,8005,8006,8007

The run.sh script already does this for you in a more ergonomic way:

WORKERS=8 ./run.sh -m --threads -2

NGA-TRAN

Great additional setting options. I am thinking about these as the follow-up PRs:

Add (unit/integration) test to verify the the number of threads
🤔 : I know DF repartitions data per number of cores. Will the number of threads here represent that number of cores and automatically applied?
🤔 : Should we try to test number of partitions > number of threads and observe functional and performance behavior?

gabotechs · 2025-09-10T08:58:06Z

Add (unit/integration) test to verify the the number of threads

Note that this is just for benchmarks, which are some kind of tests by themselves. Users are not really able to use this benchmarking functionality in their programs.

I know DF repartitions data per number of cores. Will the number of threads here represent that number of cores and automatically applied?

I don't think so, DataFusion will still see the original number of cores in the machine, and it will repartition accordingly. The benchmarks allow passing a --partitions flag (inherited from upstream DataFusion) that we can use for specifying the amount of partitions manually.

🤔 Now that you mention it, if --partitions is not provided, but --threads is provided, we should probably default the partition count to the --threads variable. I'll make that change.

Should we try to test number of partitions > number of threads and observe functional and performance behavior?

We can do that now yes! we can do something like --threads 4 and --partitions 16 now.

NGA-TRAN · 2025-09-10T13:08:00Z

We can do that now yes! we can do something like --threads 4 and --partitions 16 now.

Another question, is it possible to add PARTITIONs to the run.sh script? Something like this
WORKERS=8 THREADS=2 PARTITIONS=4 ./run.sh

gabotechs · 2025-09-11T05:52:43Z

Another question, is it possible to add PARTITIONs to the run.sh script? Something like this
WORKERS=8 THREADS=2 PARTITIONS=4 ./run.sh

I added a passthrough of the same arguments the example supports to the script. Now we can do:

WORKERS=8 ./run.sh -m --threads 2 --partitions 4

# Conflicts: # benchmarks/src/tpch/run.rs # benchmarks/src/util/memory.rs

gabotechs added 9 commits September 7, 2025 08:42

Add support for in-memory TPCH tests

b23c0c7

Add --threads and --workers options in tpch benchmarks

aa7fc0d

Remove useless stuff

8a27e9b

Automatically resolve tpch paths

6a17cca

Draft: spawn workers correctly in different tokio runtimes

dab7670

Register tables only in one place

33ba32d

Spawn benchmark workers as a separate command

ded5a8c

Rollback unnecessary changes in localhost.rs

0f83b86

Add run script

76120df

NGA-TRAN approved these changes Sep 9, 2025

View reviewed changes

gabotechs added 2 commits September 10, 2025 10:58

Default partitions to threads

b974e5d

Pass through arguments

ebb570e

Base automatically changed from gabrielmusat/in-memory-tpch to main September 11, 2025 14:47

Merge branch 'main' into gabrielmusat/tpch-threads-and-workers

68890dd

# Conflicts: # benchmarks/src/tpch/run.rs # benchmarks/src/util/memory.rs

gabotechs merged commit c12c271 into main Sep 11, 2025
3 of 4 checks passed

gabotechs deleted the gabrielmusat/tpch-threads-and-workers branch September 11, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support `--threads` and `--workers` on TPCH benchmarks #130

Support `--threads` and `--workers` on TPCH benchmarks #130

Uh oh!

gabotechs commented Sep 8, 2025 •

edited

Loading

Uh oh!

NGA-TRAN left a comment •

edited

Loading

Uh oh!

gabotechs commented Sep 10, 2025

Uh oh!

NGA-TRAN commented Sep 10, 2025

Uh oh!

gabotechs commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support --threads and --workers on TPCH benchmarks #130

Support --threads and --workers on TPCH benchmarks #130

Uh oh!

Conversation

gabotechs commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Running TPCH benchmarks in distributed mode

Uh oh!

NGA-TRAN left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs commented Sep 10, 2025

Uh oh!

NGA-TRAN commented Sep 10, 2025

Uh oh!

gabotechs commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support `--threads` and `--workers` on TPCH benchmarks #130

Support `--threads` and `--workers` on TPCH benchmarks #130

gabotechs commented Sep 8, 2025 •

edited

Loading

NGA-TRAN left a comment •

edited

Loading