Add AWS CDK-based benchmarking environment #227

gabotechs · 2025-11-18T11:34:14Z

Adds some AWS CDK code that given an AWS account automatically sets up an S3 bucket for dumping TPCH parquet datasets and spawns several t3.xlarge EC2 machines running distributed DataFusion for benchmarking purposes.

This creates a reproducible environment for running benchmarks that anyone with an AWS can spawn in a few minutes and start issuing queries to a DataFusion distributed cluster. By default, the following infrastructure is automatically spawned:

4 "t3.xlarge" EC2 machines with distributed DataFusion running as a systemd service
a datafusion-distributed-benchmarks S3 bucket that can be populated with TPCH data with just 1 command
Secured networking based on AWS Session Manager that by default does not expose any port to the public, not even SSH

The deployment process is automatic, but several pre-requisites need to be met in the local machine performing the deployment.

From the README.md:

Deploy

Prerequisites

Cargo zigbuild needs to be installed in the system for cross-compiling to Linux x86_64, which
is what the benchmarking machines in AWS run on.

cargo install --locked cargo-zigbuild

Make sure to also have the x86_64-unknown-linux-gnu target installed in
your Rust toolchain:

rustup target add x86_64-unknown-linux-gnu

Ensure that you can cross-compile to Linux x86_64 before performing any deployments:

cargo zigbuild -p datafusion-distributed-benchmarks --release --bin worker --target x86_64-unknown-linux-gnu

CDK deploy

npm run cdk deploy

Populating the bucket with TPCH data

npm run sync-bucket

Connect to instances

Prerequisites

The session manager plugin for the AWS CLI needs to be installed, as that's what is used for
connecting to the EC2 machines instead of SSH.

These are the docs with installation instructions:

https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

On Mac with an Apple Silicon processor, it can be installed with:

curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac_arm64/session-manager-plugin.pkg" -o "session-manager-plugin.pkg"
sudo installer -pkg session-manager-plugin.pkg -target
sudo ln -s /usr/local/sessionmanagerplugin/bin/session-manager-plugin /usr/local/bin/session-manager-plugin

Port Forward

After performing a CDK deploy, a CNF output will be printed to stdout with instructions for port-forwarding to them.

export INSTANCE_ID=i-0000000000000000

aws ssm start-session --target $INSTANCE_ID --document-name AWS-StartPortForwardingSession --parameters "portNumber=9000,localPortNumber=9000"

Just port-forwarding the first instance is enough for issuing queries.

Connect

After performing a CDK deploy, a CNF output will be printed to stdout with instructions for connecting
to all the machines, something like this:

export INSTANCE_ID=i-0000000000000000

aws ssm start-session --target $INSTANCE_ID

The logs can be streamed with:

sudo journalctl -u worker.service -f -o cat

Running benchmarks

There's a script that will run the TPCH benchmarks against the remote cluster:

In one terminal, perform a port-forward of one machine in the cluster, something like this:

export INSTANCE_ID=i-0000000000000000
aws ssm start-session --target $INSTANCE_ID --document-name AWS-StartPortForwardingSession --parameters "portNumber=9000,localPortNumber=9000"

In another terminal, navigate to the benchmarks/cdk folder:

cd benchmarks/cdk

And run the benchmarking script

npm run datafusion-bench

Several arguments can be passed for running the benchmarks against different scale factors and with different configs,
for example:

npm run datafusion-bench  -- --sf 10 --files-per-task 4 --query 7

This reverts commit 5bdec16

jayshrivastava

🙈 LGTM

gabotechs added 13 commits November 15, 2025 12:02

Add cdk sample app

8666f5b

Add bucket and ec2 instances

3f30a75

Use a static channel resolver in benchmarks

5bdec16

Add worker.rs

6b3db39

Use an ec2 api based channel resolver

eb33c69

Revert "Use a static channel resolver in benchmarks"

81231c4

This reverts commit 5bdec16

Refactor ec2 channel resolver

5b42b3b

Add datafusion-bench script

e7ce92a

Fix worker ec2 resolver

4472eba

Improve benchmarking script and cdk tests

428724c

Show diff with previous run

d3fb319

Better logs for the benchmarks

7539874

Improve readme

c7ad740

gabotechs mentioned this pull request Nov 18, 2025

TPCH queries hang in benchmarks #228

Open

gabotechs added 2 commits November 18, 2025 13:57

Allays drop tables

9e04891

Improve Cfn output

a8d2418

jayshrivastava self-requested a review November 19, 2025 15:24

jayshrivastava approved these changes Nov 19, 2025

View reviewed changes

gabotechs merged commit 6b66c22 into main Nov 19, 2025
4 checks passed

gabotechs deleted the gabrielmusat/cdk-benchmarks branch November 19, 2025 16:17

gabotechs mentioned this pull request Nov 19, 2025

Distributed CLI / Benchmarking Tool #214

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AWS CDK-based benchmarking environment #227

Add AWS CDK-based benchmarking environment #227

Uh oh!

gabotechs commented Nov 18, 2025 •

edited

Loading

Uh oh!

jayshrivastava left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add AWS CDK-based benchmarking environment #227

Add AWS CDK-based benchmarking environment #227

Uh oh!

Conversation

gabotechs commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy

Prerequisites

CDK deploy

Populating the bucket with TPCH data

Connect to instances

Prerequisites

Port Forward

Connect

Running benchmarks

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabotechs commented Nov 18, 2025 •

edited

Loading