@@ -17,120 +17,129 @@ specific language governing permissions and limitations
1717under the License.
1818-->
1919
20- # Comet Benchmarking in AWS
20+ # Comet Benchmarking in EC2
2121
22- This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files located in S3.
22+ This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files either located on an
23+ attached EBS volume, or stored in S3.
2324
24- ## Data Generation
25+ ## Create EC2 Instance
2526
2627- Create an EC2 instance with an EBS volume sized for approximately 2x the size of
2728 the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on)
28- - Create an S3 bucket to store the Parquet files
29+ - Optionally, create an S3 bucket to store the Parquet files
2930
30- Install prerequisites:
31+ Recommendation: Use ` c7i.4xlarge ` instance type with ` provisioned iops 2 ` EBS volume (8000 IOPS).
3132
32- ``` shell
33- sudo yum install -y docker git python3-pip
33+ ## Install Prerequisites
3434
35- sudo systemctl start docker
36- sudo systemctl enable docker
37- sudo usermod -aG docker ec2-user
38- newgrp docker
35+ ``` shell
36+ sudo yum update -y
37+ sudo yum install -y git protobuf-compiler java-17-amazon-corretto-headless java-17-amazon-corretto-devel
38+ sudo yum groupinstall -y " Development Tools"
39+ export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
40+ ```
3941
40- docker pull ghcr.io/scalytics/tpch-docker:main
42+ ## Generate Benchmark Data
4143
42- pip3 install datafusion
44+ ``` shell
45+ curl --proto ' =https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
46+ RUSTFLAGS=' -C target-cpu=native' cargo install tpchgen-cli
47+ tpchgen-cli -s 100 --format parquet --parts 32 --output-dir data
4348```
4449
45- Run the data generation script:
50+ Rename the generated directories so that they have a ` .parquet ` suffix. For example, rename ` customer ` to
51+ ` customer.parquet ` .
52+
53+ Set the ` TPCH_DATA ` environment variable. This will be referenced by the benchmarking scripts.
4654
4755``` shell
48- git clone https://github.com/apache/datafusion-benchmarks.git
49- cd datafusion-benchmarks/tpch
50- nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 &
56+ export TPCH_DATA=/home/ec2-user/data
5157```
5258
53- Check on progress with the following commands:
59+ ## Install Apache Spark
5460
5561``` shell
56- docker ps
57- du -h -d 1 data
62+ export SPARK_VERSION=3.5.6
63+ wget https://archive.apache.org/dist/spark/spark-$SPARK_VERSION /spark-$SPARK_VERSION -bin-hadoop3.tgz
64+ tar xzf spark-$SPARK_VERSION -bin-hadoop3.tgz
65+ sudo mv spark-$SPARK_VERSION -bin-hadoop3 /opt
66+ export SPARK_HOME=/opt/spark-$SPARK_VERSION -bin-hadoop3/
67+ mkdir /tmp/spark-events
5868```
5969
60- Fix ownership in the generated files :
70+ Set ` SPARK_MASTER ` env var (IP address will need to be edited) :
6171
6272``` shell
63- sudo chown -R ec2-user:docker data
73+ export SPARK_MASTER=spark://172.31.34.87:7077
6474```
6575
66- Convert to Parquet:
76+ Set ` SPARK_LOCAL_DIRS ` to point to EBS volume
6777
6878``` shell
69- nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 &
79+ sudo mkdir /mnt/tmp
80+ sudo chmod 777 /mnt/tmp
81+ mv $SPARK_HOME /conf/spark-env.sh.template $SPARK_HOME /conf/spark-env.sh
7082```
7183
72- Delete the CSV files :
84+ Add the following entry to ` spark-env.sh ` :
7385
7486``` shell
75- cd data
76- rm * .tbl.*
87+ SPARK_LOCAL_DIRS=/mnt/tmp
7788```
7889
79- Copy the Parquet files to S3:
90+ ## Git Clone DataFusion Repositories
8091
8192``` shell
82- aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
93+ git clone https://github.com/apache/datafusion-benchmarks.git
94+ git clone https://github.com/apache/datafusion-comet.git
8395```
8496
85- ## Install Spark
86-
87- Install Java
97+ Build Comet
8898
8999``` shell
90- sudo yum install -y java-17-amazon-corretto-headless java-17-amazon-corretto-devel
100+ cd datafusion-comet
101+ make release
91102```
92103
93- Set JAVA_HOME
104+ Set ` COMET_JAR ` environment variable.
94105
95106``` shell
96- export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
107+ export COMET_JAR=/home/ec2-user/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.11.0-SNAPSHOT.jar
97108```
98109
99- Install Spark
110+ ## Run Benchmarks
111+
112+ Use the scripts in ` dev/benchmarks ` in the Comet repository.
100113
101114``` shell
102- wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
103- tar xzf spark-3.5.4-bin-hadoop3.tgz
104- sudo mv spark-3.5.4-bin-hadoop3 /opt
105- export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
106- mkdir /tmp/spark-events
115+ cd dev/benchmarks
116+ export TPCH_QUERIES=/home/ec2-user/datafusion-benchmarks/tpch/queries/
107117```
108118
109- Set ` SPARK_MASTER ` env var (IP address will need to be edited) :
119+ Run Spark benchmark :
110120
111121``` shell
112- export SPARK_MASTER=spark://172.31.34.87:7077
122+ ./spark-tpch.sh
113123```
114124
115- Set ` SPARK_LOCAL_DIRS ` to point to EBS volume
125+ Run Comet benchmark:
116126
117127``` shell
118- sudo mkdir /mnt/tmp
119- sudo chmod 777 /mnt/tmp
120- mv $SPARK_HOME /conf/spark-env.sh.template $SPARK_HOME /conf/spark-env.sh
128+ ./comet-tpch.sh
121129```
122130
123- Add the following entry to ` spark-env.sh ` :
131+ ## Running Benchmarks with S3
132+
133+ Copy the Parquet data to an S3 bucket.
124134
125135``` shell
126- SPARK_LOCAL_DIRS=/mnt/tmp
136+ aws s3 cp /home/ec2-user/data s3://your-bucket-name/--recursive
127137```
128138
129- Start Spark in standalone mode:
139+ Update ` TPCH_DATA ` environment variable.
130140
131141``` shell
132- $SPARK_HOME /sbin/start-master.sh
133- $SPARK_HOME /sbin/start-worker.sh $SPARK_MASTER
142+ export TPCH_DATA=s3a://your-bucket-name
134143```
135144
136145Install Hadoop jar files:
@@ -148,76 +157,11 @@ aws_access_key_id=your-access-key
148157aws_secret_access_key=your-secret-key
149158```
150159
151- ## Run Spark Benchmarks
152-
153- Run the following command (the ` --data ` parameter will need to be updated to point to your S3 bucket):
154-
155- ``` shell
156- $SPARK_HOME /bin/spark-submit \
157- --master $SPARK_MASTER \
158- --conf spark.driver.memory=4G \
159- --conf spark.executor.instances=1 \
160- --conf spark.executor.cores=8 \
161- --conf spark.cores.max=8 \
162- --conf spark.executor.memory=16g \
163- --conf spark.eventLog.enabled=false \
164- --conf spark.local.dir=/mnt/tmp \
165- --conf spark.driver.extraJavaOptions=" -Djava.io.tmpdir=/mnt/tmp" \
166- --conf spark.executor.extraJavaOptions=" -Djava.io.tmpdir=/mnt/tmp" \
167- --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
168- --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
169- tpcbench.py \
170- --benchmark tpch \
171- --data s3a://your-bucket-name/top-level-folder \
172- --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
173- --output . \
174- --iterations 1
175- ```
176-
177- ## Run Comet Benchmarks
178-
179- Install Comet JAR from Maven:
180-
181- ``` shell
182- wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.9.0/comet-spark-spark3.5_2.12-0.9.0.jar -P $SPARK_HOME /jars
183- export COMET_JAR=$SPARK_HOME /jars/comet-spark-spark3.5_2.12-0.9.0.jar
184- ```
185-
186- Run the following command (the ` --data ` parameter will need to be updated to point to your S3 bucket):
187-
188- ``` shell
189- $SPARK_HOME /bin/spark-submit \
190- --master $SPARK_MASTER \
191- --conf spark.driver.memory=4G \
192- --conf spark.executor.instances=1 \
193- --conf spark.executor.cores=8 \
194- --conf spark.cores.max=8 \
195- --conf spark.executor.memory=16g \
196- --conf spark.memory.offHeap.enabled=true \
197- --conf spark.memory.offHeap.size=16g \
198- --conf spark.eventLog.enabled=false \
199- --conf spark.local.dir=/mnt/tmp \
200- --conf spark.driver.extraJavaOptions=" -Djava.io.tmpdir=/mnt/tmp" \
201- --conf spark.executor.extraJavaOptions=" -Djava.io.tmpdir=/mnt/tmp" \
202- --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
203- --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
204- --jars $COMET_JAR \
205- --driver-class-path $COMET_JAR \
206- --conf spark.driver.extraClassPath=$COMET_JAR \
207- --conf spark.executor.extraClassPath=$COMET_JAR \
208- --conf spark.plugins=org.apache.spark.CometPlugin \
209- --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
210- --conf spark.comet.enabled=true \
211- --conf spark.comet.expression.allowIncompatible=true \
212- --conf spark.comet.exec.replaceSortMergeJoin=true \
213- --conf spark.comet.exec.shuffle.enabled=true \
214- --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
215- --conf spark.comet.exec.shuffle.compression.codec=lz4 \
216- --conf spark.comet.exec.shuffle.compression.level=1 \
217- tpcbench.py \
218- --benchmark tpch \
219- --data s3a://your-bucket-name/top-level-folder \
220- --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
221- --output . \
222- --iterations 1
160+ Modify the scripts to add the following configurations.
161+
162+ ``` shell
163+ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
164+ --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
223165```
166+
167+ Now run the ` spark-tpch.sh ` and ` comet-tpch.sh ` scripts.
0 commit comments