Skip to content

Commit fd09a79

Browse files
authored
docs: Add instructions on running TPC-H on macOS (#1647)
1 parent 7e2ebab commit fd09a79

File tree

2 files changed

+146
-0
lines changed

2 files changed

+146
-0
lines changed

docs/source/contributor-guide/benchmarking.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ benchmarking documentation and scripts are available in the [DataFusion Benchmar
2424

2525
Available benchmarking guides:
2626

27+
- [Benchmarking on macOS](benchmarking_macos.md)
2728
- [Benchmarking on AWS EC2](benchmarking_aws_ec2)
2829

2930
We also have many micro benchmarks that can be run from an IDE located [here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Comet Benchmarking on macOS
21+
22+
This guide is for setting up TPC-H benchmarks locally on macOS using the 100 GB dataset.
23+
24+
Note that running this benchmark on macOS is not ideal because we cannot force Spark or Comet to use performance
25+
cores rather than efficiency cores, and background processes are sharing these cores. Also, power and thermal
26+
management may throttle CPU cores.
27+
28+
## Prerequisites
29+
30+
Java and Rust must be installed locally.
31+
32+
## Data Generation
33+
34+
```shell
35+
cargo install tpchgen-rs
36+
tpchgen-cli -s 100 --format=parquet
37+
```
38+
39+
## Clone the DataFusion Benchmarks Repository
40+
41+
```shell
42+
git clone https://github.com/apache/datafusion-benchmarks.git
43+
```
44+
45+
## Install Spark
46+
47+
Install Spark
48+
49+
```shell
50+
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
51+
tar xzf spark-3.5.4-bin-hadoop3.tgz
52+
sudo mv spark-3.5.4-bin-hadoop3 /opt
53+
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
54+
mkdir /tmp/spark-events
55+
```
56+
57+
58+
Start Spark in standalone mode:
59+
60+
```shell
61+
$SPARK_HOME/sbin/start-master.sh
62+
```
63+
64+
Set `SPARK_MASTER` env var (host name will need to be edited):
65+
66+
```shell
67+
export SPARK_MASTER=spark://Rustys-MacBook-Pro.local:7077
68+
```
69+
70+
71+
```shell
72+
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
73+
```
74+
75+
76+
## Run Spark Benchmarks
77+
78+
Run the following command (the `--data` parameter will need to be updated to point to your TPC-H data):
79+
80+
```shell
81+
$SPARK_HOME/bin/spark-submit \
82+
--master $SPARK_MASTER \
83+
--conf spark.driver.memory=8G \
84+
--conf spark.executor.instances=1 \
85+
--conf spark.executor.cores=8 \
86+
--conf spark.cores.max=8 \
87+
--conf spark.executor.memory=16g \
88+
--conf spark.memory.offHeap.enabled=true \
89+
--conf spark.memory.offHeap.size=16g \
90+
--conf spark.eventLog.enabled=true \
91+
/path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
92+
--name spark \
93+
--benchmark tpch \
94+
--data /Users/rusty/Data/tpch/sf100 \
95+
--queries /path/to/datafusion-benchmarks/tpch/queries \
96+
--output . \
97+
--iterations 1
98+
```
99+
100+
## Run Comet Benchmarks
101+
102+
Build Comet from source, with `mimalloc` enabled.
103+
104+
```shell
105+
make release COMET_FEATURES=mimalloc
106+
```
107+
108+
Set `COMET_JAR` to point to the location of the Comet jar file.
109+
110+
```shell
111+
export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar
112+
```
113+
114+
Run the following command (the `--data` parameter will need to be updated to point to your S3 bucket):
115+
116+
```shell
117+
$SPARK_HOME/bin/spark-submit \
118+
--master $SPARK_MASTER \
119+
--conf spark.driver.memory=8G \
120+
--conf spark.executor.instances=1 \
121+
--conf spark.executor.cores=8 \
122+
--conf spark.cores.max=8 \
123+
--conf spark.executor.memory=16g \
124+
--conf spark.memory.offHeap.enabled=true \
125+
--conf spark.memory.offHeap.size=16g \
126+
--conf spark.eventLog.enabled=true \
127+
--jars $COMET_JAR \
128+
--driver-class-path $COMET_JAR \
129+
--conf spark.driver.extraClassPath=$COMET_JAR \
130+
--conf spark.executor.extraClassPath=$COMET_JAR \
131+
--conf spark.plugins=org.apache.spark.CometPlugin \
132+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
133+
--conf spark.comet.enabled=true \
134+
--conf spark.comet.exec.shuffle.enableFastEncoding=true \
135+
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
136+
--conf spark.comet.exec.replaceSortMergeJoin=true \
137+
--conf spark.comet.cast.allowIncompatible=true \
138+
/path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
139+
--name comet \
140+
--benchmark tpch \
141+
--data /path/to/tpch-data/ \
142+
--queries /path/to/datafusion-benchmarks//tpch/queries \
143+
--output . \
144+
--iterations 1
145+
```

0 commit comments

Comments
 (0)