|
| 1 | +<!-- |
| 2 | +Licensed to the Apache Software Foundation (ASF) under one |
| 3 | +or more contributor license agreements. See the NOTICE file |
| 4 | +distributed with this work for additional information |
| 5 | +regarding copyright ownership. The ASF licenses this file |
| 6 | +to you under the Apache License, Version 2.0 (the |
| 7 | +"License"); you may not use this file except in compliance |
| 8 | +with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | +Unless required by applicable law or agreed to in writing, |
| 13 | +software distributed under the License is distributed on an |
| 14 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | +KIND, either express or implied. See the License for the |
| 16 | +specific language governing permissions and limitations |
| 17 | +under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Comet Benchmarking in AWS |
| 21 | + |
| 22 | +This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files located in S3. |
| 23 | + |
| 24 | +## Data Generation |
| 25 | + |
| 26 | +- Create an EC2 instance with an EBS volume sized for approximately 2x the size of |
| 27 | + the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on) |
| 28 | +- Create an S3 bucket to store the Parquet files |
| 29 | + |
| 30 | +Install prerequisites: |
| 31 | + |
| 32 | +```shell |
| 33 | +sudo yum install -y docker git python3-pip |
| 34 | + |
| 35 | +sudo systemctl start docker |
| 36 | +sudo systemctl enable docker |
| 37 | +sudo usermod -aG docker ec2-user |
| 38 | +newgrp docker |
| 39 | + |
| 40 | +docker pull ghcr.io/scalytics/tpch-docker:main |
| 41 | + |
| 42 | +pip3 install datafusion |
| 43 | +``` |
| 44 | + |
| 45 | +Run the data generation script: |
| 46 | + |
| 47 | +```shell |
| 48 | +git clone https://github.com/apache/datafusion-benchmarks.git |
| 49 | +cd datafusion-benchmarks/tpch |
| 50 | +nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 & |
| 51 | +``` |
| 52 | + |
| 53 | +Check on progress with the following commands: |
| 54 | + |
| 55 | +```shell |
| 56 | +docker ps |
| 57 | +du -h -d 1 data |
| 58 | +``` |
| 59 | + |
| 60 | +Fix ownership in the generated files: |
| 61 | + |
| 62 | +```shell |
| 63 | +sudo chown -R ec2-user:docker data |
| 64 | +``` |
| 65 | + |
| 66 | +Convert to Parquet: |
| 67 | + |
| 68 | +```shell |
| 69 | +nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 & |
| 70 | +``` |
| 71 | + |
| 72 | +Delete the CSV files: |
| 73 | + |
| 74 | +```shell |
| 75 | +cd data |
| 76 | +rm *.tbl.* |
| 77 | +``` |
| 78 | + |
| 79 | +Copy the Parquet files to S3: |
| 80 | + |
| 81 | +```shell |
| 82 | +aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive |
| 83 | +``` |
| 84 | + |
| 85 | +## Install Spark |
| 86 | + |
| 87 | +Install Java |
| 88 | + |
| 89 | +```shell |
| 90 | +sudo yum install -y java-17-amazon-corretto-headless java-17-amazon-corretto-devel |
| 91 | +``` |
| 92 | + |
| 93 | +Set JAVA_HOME |
| 94 | + |
| 95 | +```shell |
| 96 | +export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto |
| 97 | +``` |
| 98 | + |
| 99 | +Install Spark |
| 100 | + |
| 101 | +```shell |
| 102 | +wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz |
| 103 | +tar xzf spark-3.5.4-bin-hadoop3.tgz |
| 104 | +sudo mv spark-3.5.4-bin-hadoop3 /opt |
| 105 | +export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/ |
| 106 | +mkdir /tmp/spark-events |
| 107 | +``` |
| 108 | + |
| 109 | +Set `SPARK_MASTER` env var (IP address will need to be edited): |
| 110 | + |
| 111 | +```shell |
| 112 | +export SPARK_MASTER=spark://172.31.34.87:7077 |
| 113 | +``` |
| 114 | + |
| 115 | +Set `SPARK_LOCAL_DIRS` to point to EBS volume |
| 116 | + |
| 117 | +```shell |
| 118 | +sudo mkdir /mnt/tmp |
| 119 | +sudo chmod 777 /mnt/tmp |
| 120 | +mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh |
| 121 | +``` |
| 122 | + |
| 123 | +Add the following entry to `spark-env.sh`: |
| 124 | + |
| 125 | +```shell |
| 126 | +SPARK_LOCAL_DIRS=/mnt/tmp |
| 127 | +``` |
| 128 | + |
| 129 | +Start Spark in standalone mode: |
| 130 | + |
| 131 | +```shell |
| 132 | +$SPARK_HOME/sbin/start-master.sh |
| 133 | +$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER |
| 134 | +``` |
| 135 | + |
| 136 | +Install Hadoop jar files: |
| 137 | + |
| 138 | +```shell |
| 139 | +wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -P $SPARK_HOME/jars |
| 140 | +wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar -P $SPARK_HOME/jars |
| 141 | +``` |
| 142 | + |
| 143 | +Add credentials to `~/.aws/credentials`: |
| 144 | + |
| 145 | +```shell |
| 146 | +[default] |
| 147 | +aws_access_key_id=your-access-key |
| 148 | +aws_secret_access_key=your-secret-key |
| 149 | +``` |
| 150 | + |
| 151 | +## Run Spark Benchmarks |
| 152 | + |
| 153 | +Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket): |
| 154 | + |
| 155 | +```shell |
| 156 | +$SPARK_HOME/bin/spark-submit \ |
| 157 | + --master $SPARK_MASTER \ |
| 158 | + --conf spark.driver.memory=4G \ |
| 159 | + --conf spark.executor.instances=1 \ |
| 160 | + --conf spark.executor.cores=8 \ |
| 161 | + --conf spark.cores.max=8 \ |
| 162 | + --conf spark.executor.memory=16g \ |
| 163 | + --conf spark.eventLog.enabled=false \ |
| 164 | + --conf spark.local.dir=/mnt/tmp \ |
| 165 | + --conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \ |
| 166 | + --conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \ |
| 167 | + --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ |
| 168 | + --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \ |
| 169 | + tpcbench.py \ |
| 170 | + --benchmark tpch \ |
| 171 | + --data s3a://your-bucket-name/top-level-folder \ |
| 172 | + --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \ |
| 173 | + --output . \ |
| 174 | + --iterations 1 |
| 175 | +``` |
| 176 | + |
| 177 | +## Run Comet Benchmarks |
| 178 | + |
| 179 | +Install Comet JAR from Maven: |
| 180 | + |
| 181 | +```shell |
| 182 | +wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars |
| 183 | +export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar |
| 184 | +``` |
| 185 | + |
| 186 | +Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket): |
| 187 | + |
| 188 | +```shell |
| 189 | +$SPARK_HOME/bin/spark-submit \ |
| 190 | + --master $SPARK_MASTER \ |
| 191 | + --conf spark.driver.memory=4G \ |
| 192 | + --conf spark.executor.instances=1 \ |
| 193 | + --conf spark.executor.cores=8 \ |
| 194 | + --conf spark.cores.max=8 \ |
| 195 | + --conf spark.executor.memory=16g \ |
| 196 | + --conf spark.memory.offHeap.enabled=true \ |
| 197 | + --conf spark.memory.offHeap.size=16g \ |
| 198 | + --conf spark.eventLog.enabled=false \ |
| 199 | + --conf spark.local.dir=/mnt/tmp \ |
| 200 | + --conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \ |
| 201 | + --conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \ |
| 202 | + --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ |
| 203 | + --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \ |
| 204 | + --jars $COMET_JAR \ |
| 205 | + --driver-class-path $COMET_JAR \ |
| 206 | + --conf spark.driver.extraClassPath=$COMET_JAR \ |
| 207 | + --conf spark.executor.extraClassPath=$COMET_JAR \ |
| 208 | + --conf spark.plugins=org.apache.spark.CometPlugin \ |
| 209 | + --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ |
| 210 | + --conf spark.comet.enabled=true \ |
| 211 | + --conf spark.comet.cast.allowIncompatible=true \ |
| 212 | + --conf spark.comet.exec.replaceSortMergeJoin=true \ |
| 213 | + --conf spark.comet.exec.shuffle.enabled=true \ |
| 214 | + --conf spark.comet.exec.shuffle.fallbackToColumnar=true \ |
| 215 | + --conf spark.comet.exec.shuffle.compression.codec=lz4 \ |
| 216 | + --conf spark.comet.exec.shuffle.compression.level=1 \ |
| 217 | + tpcbench.py \ |
| 218 | + --benchmark tpch \ |
| 219 | + --data s3a://your-bucket-name/top-level-folder \ |
| 220 | + --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \ |
| 221 | + --output . \ |
| 222 | + --iterations 1 |
| 223 | +``` |
0 commit comments