Skip to content

Conversation

@gabotechs
Copy link
Collaborator

@gabotechs gabotechs commented Nov 18, 2025

Adds some AWS CDK code that given an AWS account automatically sets up an S3 bucket for dumping TPCH parquet datasets and spawns several t3.xlarge EC2 machines running distributed DataFusion for benchmarking purposes.

This creates a reproducible environment for running benchmarks that anyone with an AWS can spawn in a few minutes and start issuing queries to a DataFusion distributed cluster. By default, the following infrastructure is automatically spawned:

  • 4 "t3.xlarge" EC2 machines with distributed DataFusion running as a systemd service
  • a datafusion-distributed-benchmarks S3 bucket that can be populated with TPCH data with just 1 command
  • Secured networking based on AWS Session Manager that by default does not expose any port to the public, not even SSH

The deployment process is automatic, but several pre-requisites need to be met in the local machine performing the deployment.


From the README.md:

Deploy

Prerequisites

Cargo zigbuild needs to be installed in the system for cross-compiling to Linux x86_64, which
is what the benchmarking machines in AWS run on.

cargo install --locked cargo-zigbuild

Make sure to also have the x86_64-unknown-linux-gnu target installed in
your Rust toolchain:

rustup target add x86_64-unknown-linux-gnu

Ensure that you can cross-compile to Linux x86_64 before performing any deployments:

cargo zigbuild -p datafusion-distributed-benchmarks --release --bin worker --target x86_64-unknown-linux-gnu

CDK deploy

npm run cdk deploy

Populating the bucket with TPCH data

npm run sync-bucket

Connect to instances

Prerequisites

The session manager plugin for the AWS CLI needs to be installed, as that's what is used for
connecting to the EC2 machines instead of SSH.

These are the docs with installation instructions:

https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

On Mac with an Apple Silicon processor, it can be installed with:

curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac_arm64/session-manager-plugin.pkg" -o "session-manager-plugin.pkg"
sudo installer -pkg session-manager-plugin.pkg -target
sudo ln -s /usr/local/sessionmanagerplugin/bin/session-manager-plugin /usr/local/bin/session-manager-plugin

Port Forward

After performing a CDK deploy, a CNF output will be printed to stdout with instructions for port-forwarding to them.

export INSTANCE_ID=i-0000000000000000

aws ssm start-session --target $INSTANCE_ID --document-name AWS-StartPortForwardingSession --parameters "portNumber=9000,localPortNumber=9000"                                     

Just port-forwarding the first instance is enough for issuing queries.

Connect

After performing a CDK deploy, a CNF output will be printed to stdout with instructions for connecting
to all the machines, something like this:

export INSTANCE_ID=i-0000000000000000

aws ssm start-session --target $INSTANCE_ID

The logs can be streamed with:

sudo journalctl -u worker.service -f -o cat

Running benchmarks

There's a script that will run the TPCH benchmarks against the remote cluster:

In one terminal, perform a port-forward of one machine in the cluster, something like this:

export INSTANCE_ID=i-0000000000000000
aws ssm start-session --target $INSTANCE_ID --document-name AWS-StartPortForwardingSession --parameters "portNumber=9000,localPortNumber=9000"                                     

In another terminal, navigate to the benchmarks/cdk folder:

cd benchmarks/cdk

And run the benchmarking script

npm run datafusion-bench

Several arguments can be passed for running the benchmarks against different scale factors and with different configs,
for example:

npm run datafusion-bench  -- --sf 10 --files-per-task 4 --query 7

@jayshrivastava jayshrivastava self-requested a review November 19, 2025 15:24
Copy link
Collaborator

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 LGTM

@gabotechs gabotechs merged commit 6b66c22 into main Nov 19, 2025
4 checks passed
@gabotechs gabotechs deleted the gabrielmusat/cdk-benchmarks branch November 19, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants