-
Notifications
You must be signed in to change notification settings - Fork 19
Add AWS CDK-based benchmarking environment #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This reverts commit 5bdec16
jayshrivastava
approved these changes
Nov 19, 2025
Collaborator
jayshrivastava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙈 LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds some AWS CDK code that given an AWS account automatically sets up an S3 bucket for dumping TPCH parquet datasets and spawns several
t3.xlargeEC2 machines running distributed DataFusion for benchmarking purposes.This creates a reproducible environment for running benchmarks that anyone with an AWS can spawn in a few minutes and start issuing queries to a DataFusion distributed cluster. By default, the following infrastructure is automatically spawned:
"t3.xlarge"EC2 machines with distributed DataFusion running as a systemd servicedatafusion-distributed-benchmarksS3 bucket that can be populated with TPCH data with just 1 commandThe deployment process is automatic, but several pre-requisites need to be met in the local machine performing the deployment.
From the README.md:
Deploy
Prerequisites
Cargo zigbuild needs to be installed in the system for cross-compiling to Linux x86_64, which
is what the benchmarking machines in AWS run on.
Make sure to also have the
x86_64-unknown-linux-gnutarget installed inyour Rust toolchain:
Ensure that you can cross-compile to Linux x86_64 before performing any deployments:
CDK deploy
Populating the bucket with TPCH data
Connect to instances
Prerequisites
The session manager plugin for the AWS CLI needs to be installed, as that's what is used for
connecting to the EC2 machines instead of SSH.
These are the docs with installation instructions:
https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
On Mac with an Apple Silicon processor, it can be installed with:
Port Forward
After performing a CDK deploy, a CNF output will be printed to stdout with instructions for port-forwarding to them.
Just port-forwarding the first instance is enough for issuing queries.
Connect
After performing a CDK deploy, a CNF output will be printed to stdout with instructions for connecting
to all the machines, something like this:
The logs can be streamed with:
Running benchmarks
There's a script that will run the TPCH benchmarks against the remote cluster:
In one terminal, perform a port-forward of one machine in the cluster, something like this:
In another terminal, navigate to the benchmarks/cdk folder:
cd benchmarks/cdkAnd run the benchmarking script
Several arguments can be passed for running the benchmarks against different scale factors and with different configs,
for example: