Skip to content

Commit 751d2bb

Browse files
authored
docs: Improve EC2 benchmarking guide (#2474)
1 parent 2512b1a commit 751d2bb

File tree

10 files changed

+88
-126
lines changed

10 files changed

+88
-126
lines changed

dev/benchmarks/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ under the License.
2222
This directory contains scripts used for generating benchmark results that are published in this repository and in
2323
the Comet documentation.
2424

25+
For full instructions on running these benchmarks on an EC2 instance, see the [Comet Benchmarking on EC2 Guide].
26+
27+
[Comet Benchmarking on EC2 Guide]: https://datafusion.apache.org/comet/contributor-guide/benchmarking_aws_ec2.html
28+
2529
## Example usage
2630

2731
Set Spark environment variables:

dev/benchmarks/blaze-tpcds.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ $SPARK_HOME/bin/spark-submit \
4242
--conf spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager \
4343
--conf spark.blaze.enable=true \
4444
--conf spark.blaze.forceShuffledHashJoin=true \
45+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
46+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4547
tpcbench.py \
4648
--name blaze \
4749
--benchmark tpcds \

dev/benchmarks/blaze-tpch.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ $SPARK_HOME/bin/spark-submit \
4242
--conf spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager \
4343
--conf spark.blaze.enable=true \
4444
--conf spark.blaze.forceShuffledHashJoin=true \
45+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
46+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4547
tpcbench.py \
4648
--name blaze \
4749
--benchmark tpch \

dev/benchmarks/comet-tpcds.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@ $SPARK_HOME/bin/spark-submit \
4141
--conf spark.plugins=org.apache.spark.CometPlugin \
4242
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
4343
--conf spark.comet.expression.allowIncompatible=true \
44-
--conf spark.comet.scan.impl=native_datafusion \
44+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
45+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4546
tpcbench.py \
4647
--name comet \
4748
--benchmark tpcds \

dev/benchmarks/comet-tpch.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,8 @@ $SPARK_HOME/bin/spark-submit \
4242
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
4343
--conf spark.comet.exec.replaceSortMergeJoin=true \
4444
--conf spark.comet.expression.allowIncompatible=true \
45-
--conf spark.comet.scan.impl=native_datafusion \
45+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
46+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4647
tpcbench.py \
4748
--name comet \
4849
--benchmark tpch \

dev/benchmarks/gluten-tpcds.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ $SPARK_HOME/bin/spark-submit \
4242
--conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
4343
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
4444
--conf spark.sql.session.timeZone=UTC \
45+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
46+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4547
tpcbench.py \
4648
--name gluten \
4749
--benchmark tpcds \

dev/benchmarks/gluten-tpch.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ $SPARK_HOME/bin/spark-submit \
4242
--conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
4343
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
4444
--conf spark.sql.session.timeZone=UTC \
45+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
46+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
4547
tpcbench.py \
4648
--name gluten \
4749
--benchmark tpch \

dev/benchmarks/spark-tpcds.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ $SPARK_HOME/bin/spark-submit \
3434
--conf spark.memory.offHeap.enabled=true \
3535
--conf spark.memory.offHeap.size=16g \
3636
--conf spark.eventLog.enabled=true \
37+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
38+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
3739
tpcbench.py \
3840
--name spark \
3941
--benchmark tpcds \

dev/benchmarks/spark-tpch.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ $SPARK_HOME/bin/spark-submit \
3434
--conf spark.memory.offHeap.enabled=true \
3535
--conf spark.memory.offHeap.size=16g \
3636
--conf spark.eventLog.enabled=true \
37+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
38+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
3739
tpcbench.py \
3840
--name spark \
3941
--benchmark tpch \

docs/source/contributor-guide/benchmarking_aws_ec2.md

Lines changed: 68 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -17,120 +17,129 @@ specific language governing permissions and limitations
1717
under the License.
1818
-->
1919

20-
# Comet Benchmarking in AWS
20+
# Comet Benchmarking in EC2
2121

22-
This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files located in S3.
22+
This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files either located on an
23+
attached EBS volume, or stored in S3.
2324

24-
## Data Generation
25+
## Create EC2 Instance
2526

2627
- Create an EC2 instance with an EBS volume sized for approximately 2x the size of
2728
the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on)
28-
- Create an S3 bucket to store the Parquet files
29+
- Optionally, create an S3 bucket to store the Parquet files
2930

30-
Install prerequisites:
31+
Recommendation: Use `c7i.4xlarge` instance type with `provisioned iops 2` EBS volume (8000 IOPS).
3132

32-
```shell
33-
sudo yum install -y docker git python3-pip
33+
## Install Prerequisites
3434

35-
sudo systemctl start docker
36-
sudo systemctl enable docker
37-
sudo usermod -aG docker ec2-user
38-
newgrp docker
35+
```shell
36+
sudo yum update -y
37+
sudo yum install -y git protobuf-compiler java-17-amazon-corretto-headless java-17-amazon-corretto-devel
38+
sudo yum groupinstall -y "Development Tools"
39+
export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
40+
```
3941

40-
docker pull ghcr.io/scalytics/tpch-docker:main
42+
## Generate Benchmark Data
4143

42-
pip3 install datafusion
44+
```shell
45+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
46+
RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli
47+
tpchgen-cli -s 100 --format parquet --parts 32 --output-dir data
4348
```
4449

45-
Run the data generation script:
50+
Rename the generated directories so that they have a `.parquet` suffix. For example, rename `customer` to
51+
`customer.parquet`.
52+
53+
Set the `TPCH_DATA` environment variable. This will be referenced by the benchmarking scripts.
4654

4755
```shell
48-
git clone https://github.com/apache/datafusion-benchmarks.git
49-
cd datafusion-benchmarks/tpch
50-
nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 &
56+
export TPCH_DATA=/home/ec2-user/data
5157
```
5258

53-
Check on progress with the following commands:
59+
## Install Apache Spark
5460

5561
```shell
56-
docker ps
57-
du -h -d 1 data
62+
export SPARK_VERSION=3.5.6
63+
wget https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
64+
tar xzf spark-$SPARK_VERSION-bin-hadoop3.tgz
65+
sudo mv spark-$SPARK_VERSION-bin-hadoop3 /opt
66+
export SPARK_HOME=/opt/spark-$SPARK_VERSION-bin-hadoop3/
67+
mkdir /tmp/spark-events
5868
```
5969

60-
Fix ownership in the generated files:
70+
Set `SPARK_MASTER` env var (IP address will need to be edited):
6171

6272
```shell
63-
sudo chown -R ec2-user:docker data
73+
export SPARK_MASTER=spark://172.31.34.87:7077
6474
```
6575

66-
Convert to Parquet:
76+
Set `SPARK_LOCAL_DIRS` to point to EBS volume
6777

6878
```shell
69-
nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 &
79+
sudo mkdir /mnt/tmp
80+
sudo chmod 777 /mnt/tmp
81+
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
7082
```
7183

72-
Delete the CSV files:
84+
Add the following entry to `spark-env.sh`:
7385

7486
```shell
75-
cd data
76-
rm *.tbl.*
87+
SPARK_LOCAL_DIRS=/mnt/tmp
7788
```
7889

79-
Copy the Parquet files to S3:
90+
## Git Clone DataFusion Repositories
8091

8192
```shell
82-
aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
93+
git clone https://github.com/apache/datafusion-benchmarks.git
94+
git clone https://github.com/apache/datafusion-comet.git
8395
```
8496

85-
## Install Spark
86-
87-
Install Java
97+
Build Comet
8898

8999
```shell
90-
sudo yum install -y java-17-amazon-corretto-headless java-17-amazon-corretto-devel
100+
cd datafusion-comet
101+
make release
91102
```
92103

93-
Set JAVA_HOME
104+
Set `COMET_JAR` environment variable.
94105

95106
```shell
96-
export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
107+
export COMET_JAR=/home/ec2-user/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.11.0-SNAPSHOT.jar
97108
```
98109

99-
Install Spark
110+
## Run Benchmarks
111+
112+
Use the scripts in `dev/benchmarks` in the Comet repository.
100113

101114
```shell
102-
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
103-
tar xzf spark-3.5.4-bin-hadoop3.tgz
104-
sudo mv spark-3.5.4-bin-hadoop3 /opt
105-
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
106-
mkdir /tmp/spark-events
115+
cd dev/benchmarks
116+
export TPCH_QUERIES=/home/ec2-user/datafusion-benchmarks/tpch/queries/
107117
```
108118

109-
Set `SPARK_MASTER` env var (IP address will need to be edited):
119+
Run Spark benchmark:
110120

111121
```shell
112-
export SPARK_MASTER=spark://172.31.34.87:7077
122+
./spark-tpch.sh
113123
```
114124

115-
Set `SPARK_LOCAL_DIRS` to point to EBS volume
125+
Run Comet benchmark:
116126

117127
```shell
118-
sudo mkdir /mnt/tmp
119-
sudo chmod 777 /mnt/tmp
120-
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
128+
./comet-tpch.sh
121129
```
122130

123-
Add the following entry to `spark-env.sh`:
131+
## Running Benchmarks with S3
132+
133+
Copy the Parquet data to an S3 bucket.
124134

125135
```shell
126-
SPARK_LOCAL_DIRS=/mnt/tmp
136+
aws s3 cp /home/ec2-user/data s3://your-bucket-name/--recursive
127137
```
128138

129-
Start Spark in standalone mode:
139+
Update `TPCH_DATA` environment variable.
130140

131141
```shell
132-
$SPARK_HOME/sbin/start-master.sh
133-
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
142+
export TPCH_DATA=s3a://your-bucket-name
134143
```
135144

136145
Install Hadoop jar files:
@@ -148,76 +157,11 @@ aws_access_key_id=your-access-key
148157
aws_secret_access_key=your-secret-key
149158
```
150159

151-
## Run Spark Benchmarks
152-
153-
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
154-
155-
```shell
156-
$SPARK_HOME/bin/spark-submit \
157-
--master $SPARK_MASTER \
158-
--conf spark.driver.memory=4G \
159-
--conf spark.executor.instances=1 \
160-
--conf spark.executor.cores=8 \
161-
--conf spark.cores.max=8 \
162-
--conf spark.executor.memory=16g \
163-
--conf spark.eventLog.enabled=false \
164-
--conf spark.local.dir=/mnt/tmp \
165-
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
166-
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
167-
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
168-
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
169-
tpcbench.py \
170-
--benchmark tpch \
171-
--data s3a://your-bucket-name/top-level-folder \
172-
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
173-
--output . \
174-
--iterations 1
175-
```
176-
177-
## Run Comet Benchmarks
178-
179-
Install Comet JAR from Maven:
180-
181-
```shell
182-
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.9.0/comet-spark-spark3.5_2.12-0.9.0.jar -P $SPARK_HOME/jars
183-
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.9.0.jar
184-
```
185-
186-
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
187-
188-
```shell
189-
$SPARK_HOME/bin/spark-submit \
190-
--master $SPARK_MASTER \
191-
--conf spark.driver.memory=4G \
192-
--conf spark.executor.instances=1 \
193-
--conf spark.executor.cores=8 \
194-
--conf spark.cores.max=8 \
195-
--conf spark.executor.memory=16g \
196-
--conf spark.memory.offHeap.enabled=true \
197-
--conf spark.memory.offHeap.size=16g \
198-
--conf spark.eventLog.enabled=false \
199-
--conf spark.local.dir=/mnt/tmp \
200-
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
201-
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
202-
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
203-
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
204-
--jars $COMET_JAR \
205-
--driver-class-path $COMET_JAR \
206-
--conf spark.driver.extraClassPath=$COMET_JAR \
207-
--conf spark.executor.extraClassPath=$COMET_JAR \
208-
--conf spark.plugins=org.apache.spark.CometPlugin \
209-
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
210-
--conf spark.comet.enabled=true \
211-
--conf spark.comet.expression.allowIncompatible=true \
212-
--conf spark.comet.exec.replaceSortMergeJoin=true \
213-
--conf spark.comet.exec.shuffle.enabled=true \
214-
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
215-
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
216-
--conf spark.comet.exec.shuffle.compression.level=1 \
217-
tpcbench.py \
218-
--benchmark tpch \
219-
--data s3a://your-bucket-name/top-level-folder \
220-
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
221-
--output . \
222-
--iterations 1
160+
Modify the scripts to add the following configurations.
161+
162+
```shell
163+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
164+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
223165
```
166+
167+
Now run the `spark-tpch.sh` and `comet-tpch.sh` scripts.

0 commit comments

Comments
 (0)