Skip to content

Commit e238392

Browse files
authored
docs: docs for benchmarking in aws ec2 (#1601)
1 parent ba53a7f commit e238392

File tree

2 files changed

+227
-0
lines changed

2 files changed

+227
-0
lines changed

docs/source/contributor-guide/benchmarking.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ under the License.
2222
To track progress on performance, we regularly run benchmarks derived from TPC-H and TPC-DS. Data generation and
2323
benchmarking documentation and scripts are available in the [DataFusion Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
2424

25+
Available benchmarking guides:
26+
27+
- [Benchmarking on AWS EC2](benchmarking_aws_ec2)
28+
2529
We also have many micro benchmarks that can be run from an IDE located [here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).
2630

2731
## Current Benchmark Results
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Comet Benchmarking in AWS
21+
22+
This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files located in S3.
23+
24+
## Data Generation
25+
26+
- Create an EC2 instance with an EBS volume sized for approximately 2x the size of
27+
the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on)
28+
- Create an S3 bucket to store the Parquet files
29+
30+
Install prerequisites:
31+
32+
```shell
33+
sudo yum install -y docker git python3-pip
34+
35+
sudo systemctl start docker
36+
sudo systemctl enable docker
37+
sudo usermod -aG docker ec2-user
38+
newgrp docker
39+
40+
docker pull ghcr.io/scalytics/tpch-docker:main
41+
42+
pip3 install datafusion
43+
```
44+
45+
Run the data generation script:
46+
47+
```shell
48+
git clone https://github.com/apache/datafusion-benchmarks.git
49+
cd datafusion-benchmarks/tpch
50+
nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 &
51+
```
52+
53+
Check on progress with the following commands:
54+
55+
```shell
56+
docker ps
57+
du -h -d 1 data
58+
```
59+
60+
Fix ownership in the generated files:
61+
62+
```shell
63+
sudo chown -R ec2-user:docker data
64+
```
65+
66+
Convert to Parquet:
67+
68+
```shell
69+
nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 &
70+
```
71+
72+
Delete the CSV files:
73+
74+
```shell
75+
cd data
76+
rm *.tbl.*
77+
```
78+
79+
Copy the Parquet files to S3:
80+
81+
```shell
82+
aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
83+
```
84+
85+
## Install Spark
86+
87+
Install Java
88+
89+
```shell
90+
sudo yum install -y java-17-amazon-corretto-headless java-17-amazon-corretto-devel
91+
```
92+
93+
Set JAVA_HOME
94+
95+
```shell
96+
export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
97+
```
98+
99+
Install Spark
100+
101+
```shell
102+
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
103+
tar xzf spark-3.5.4-bin-hadoop3.tgz
104+
sudo mv spark-3.5.4-bin-hadoop3 /opt
105+
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
106+
mkdir /tmp/spark-events
107+
```
108+
109+
Set `SPARK_MASTER` env var (IP address will need to be edited):
110+
111+
```shell
112+
export SPARK_MASTER=spark://172.31.34.87:7077
113+
```
114+
115+
Set `SPARK_LOCAL_DIRS` to point to EBS volume
116+
117+
```shell
118+
sudo mkdir /mnt/tmp
119+
sudo chmod 777 /mnt/tmp
120+
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
121+
```
122+
123+
Add the following entry to `spark-env.sh`:
124+
125+
```shell
126+
SPARK_LOCAL_DIRS=/mnt/tmp
127+
```
128+
129+
Start Spark in standalone mode:
130+
131+
```shell
132+
$SPARK_HOME/sbin/start-master.sh
133+
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
134+
```
135+
136+
Install Hadoop jar files:
137+
138+
```shell
139+
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -P $SPARK_HOME/jars
140+
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar -P $SPARK_HOME/jars
141+
```
142+
143+
Add credentials to `~/.aws/credentials`:
144+
145+
```shell
146+
[default]
147+
aws_access_key_id=your-access-key
148+
aws_secret_access_key=your-secret-key
149+
```
150+
151+
## Run Spark Benchmarks
152+
153+
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
154+
155+
```shell
156+
$SPARK_HOME/bin/spark-submit \
157+
--master $SPARK_MASTER \
158+
--conf spark.driver.memory=4G \
159+
--conf spark.executor.instances=1 \
160+
--conf spark.executor.cores=8 \
161+
--conf spark.cores.max=8 \
162+
--conf spark.executor.memory=16g \
163+
--conf spark.eventLog.enabled=false \
164+
--conf spark.local.dir=/mnt/tmp \
165+
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
166+
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
167+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
168+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
169+
tpcbench.py \
170+
--benchmark tpch \
171+
--data s3a://your-bucket-name/top-level-folder \
172+
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
173+
--output . \
174+
--iterations 1
175+
```
176+
177+
## Run Comet Benchmarks
178+
179+
Install Comet JAR from Maven:
180+
181+
```shell
182+
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars
183+
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
184+
```
185+
186+
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
187+
188+
```shell
189+
$SPARK_HOME/bin/spark-submit \
190+
--master $SPARK_MASTER \
191+
--conf spark.driver.memory=4G \
192+
--conf spark.executor.instances=1 \
193+
--conf spark.executor.cores=8 \
194+
--conf spark.cores.max=8 \
195+
--conf spark.executor.memory=16g \
196+
--conf spark.memory.offHeap.enabled=true \
197+
--conf spark.memory.offHeap.size=16g \
198+
--conf spark.eventLog.enabled=false \
199+
--conf spark.local.dir=/mnt/tmp \
200+
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
201+
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
202+
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
203+
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
204+
--jars $COMET_JAR \
205+
--driver-class-path $COMET_JAR \
206+
--conf spark.driver.extraClassPath=$COMET_JAR \
207+
--conf spark.executor.extraClassPath=$COMET_JAR \
208+
--conf spark.plugins=org.apache.spark.CometPlugin \
209+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
210+
--conf spark.comet.enabled=true \
211+
--conf spark.comet.cast.allowIncompatible=true \
212+
--conf spark.comet.exec.replaceSortMergeJoin=true \
213+
--conf spark.comet.exec.shuffle.enabled=true \
214+
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
215+
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
216+
--conf spark.comet.exec.shuffle.compression.level=1 \
217+
tpcbench.py \
218+
--benchmark tpch \
219+
--data s3a://your-bucket-name/top-level-folder \
220+
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
221+
--output . \
222+
--iterations 1
223+
```

0 commit comments

Comments
 (0)