Skip to content

Commit 3eced67

Browse files
authored
docs: Fix links and provide complete benchmarking scripts (apache#1284)
* fix links and provide complete scripts * fix path * fix incorrect text
1 parent d36e8d7 commit 3eced67

File tree

3 files changed

+130
-66
lines changed

3 files changed

+130
-66
lines changed

docs/source/contributor-guide/benchmark-results/tpc-ds.md

Lines changed: 63 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ under the License.
1919

2020
# Apache DataFusion Comet: Benchmarks Derived From TPC-DS
2121

22-
The following benchmarks were performed on a two node Kubernetes cluster with
23-
data stored locally in Parquet format on NVMe storage. Performance characteristics will vary in different environments
22+
The following benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and
23+
data stored locally in Parquet format on NVMe storage. Performance characteristics will vary in different environments
2424
and we encourage you to run these benchmarks in your own environments.
2525

2626
The tracking issue for improving TPC-DS performance is [#858](https://github.com/apache/datafusion-comet/issues/858).
@@ -43,3 +43,64 @@ The raw results of these benchmarks in JSON format is available here:
4343

4444
- [Spark](0.5.0/spark-tpcds.json)
4545
- [Comet](0.5.0/comet-tpcds.json)
46+
47+
# Scripts
48+
49+
Here are the scripts that were used to generate these results.
50+
51+
## Apache Spark
52+
53+
```shell
54+
#!/bin/bash
55+
$SPARK_HOME/bin/spark-submit \
56+
--master $SPARK_MASTER \
57+
--conf spark.driver.memory=8G \
58+
--conf spark.executor.memory=32G \
59+
--conf spark.executor.instances=2 \
60+
--conf spark.executor.cores=8 \
61+
--conf spark.cores.max=16 \
62+
--conf spark.eventLog.enabled=true \
63+
tpcbench.py \
64+
--benchmark tpcds \
65+
--name spark \
66+
--data /mnt/bigdata/tpcds/sf100/ \
67+
--queries ../../tpcds/ \
68+
--output . \
69+
--iterations 5
70+
```
71+
72+
## Apache Spark + Comet
73+
74+
```shell
75+
#!/bin/bash
76+
$SPARK_HOME/bin/spark-submit \
77+
--master $SPARK_MASTER \
78+
--conf spark.driver.memory=8G \
79+
--conf spark.executor.instances=2 \
80+
--conf spark.executor.memory=16G \
81+
--conf spark.executor.cores=8 \
82+
--total-executor-cores=16 \
83+
--conf spark.eventLog.enabled=true \
84+
--conf spark.driver.maxResultSize=2G \
85+
--conf spark.memory.offHeap.enabled=true \
86+
--conf spark.memory.offHeap.size=24g \
87+
--jars $COMET_JAR \
88+
--conf spark.driver.extraClassPath=$COMET_JAR \
89+
--conf spark.executor.extraClassPath=$COMET_JAR \
90+
--conf spark.plugins=org.apache.spark.CometPlugin \
91+
--conf spark.comet.enabled=true \
92+
--conf spark.comet.cast.allowIncompatible=true \
93+
--conf spark.comet.exec.replaceSortMergeJoin=false \
94+
--conf spark.comet.exec.shuffle.enabled=true \
95+
--conf spark.comet.exec.shuffle.mode=auto \
96+
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
97+
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
98+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
99+
tpcbench.py \
100+
--name comet \
101+
--benchmark tpcds \
102+
--data /mnt/bigdata/tpcds/sf100/ \
103+
--queries ../../tpcds/ \
104+
--output . \
105+
--iterations 5
106+
```

docs/source/contributor-guide/benchmark-results/tpc-h.md

Lines changed: 67 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,84 @@ and we encourage you to run these benchmarks in your own environments.
2525

2626
The tracking issue for improving TPC-H performance is [#391](https://github.com/apache/datafusion-comet/issues/391).
2727

28-
![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_allqueries.png)
28+
![](../../_static/images/benchmark-results/0.5.0/tpch_allqueries.png)
2929

3030
Here is a breakdown showing relative performance of Spark and Comet for each query.
3131

32-
![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_compare.png)
32+
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_compare.png)
3333

3434
The following chart shows how much Comet currently accelerates each query from the benchmark in relative terms.
3535

36-
![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_rel.png)
36+
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_rel.png)
3737

3838
The following chart shows how much Comet currently accelerates each query from the benchmark in absolute terms.
3939

40-
![](../../_static/images/benchmark-results/0.5.0-SNAPSHOT-2025-01-09/tpch_queries_speedup_abs.png)
40+
![](../../_static/images/benchmark-results/0.5.0/tpch_queries_speedup_abs.png)
4141

4242
The raw results of these benchmarks in JSON format is available here:
4343

4444
- [Spark](0.5.0/spark-tpch.json)
4545
- [Comet](0.5.0/comet-tpch.json)
46+
47+
# Scripts
48+
49+
Here are the scripts that were used to generate these results.
50+
51+
## Apache Spark
52+
53+
```shell
54+
#!/bin/bash
55+
$SPARK_HOME/bin/spark-submit \
56+
--master $SPARK_MASTER \
57+
--conf spark.driver.memory=8G \
58+
--conf spark.executor.instances=1 \
59+
--conf spark.executor.cores=8 \
60+
--conf spark.cores.max=8 \
61+
--conf spark.executor.memory=16g \
62+
--conf spark.memory.offHeap.enabled=true \
63+
--conf spark.memory.offHeap.size=16g \
64+
--conf spark.eventLog.enabled=true \
65+
tpcbench.py \
66+
--name spark \
67+
--benchmark tpch \
68+
--data /mnt/bigdata/tpch/sf100/ \
69+
--queries ../../tpch/queries \
70+
--output . \
71+
--iterations 5
72+
73+
```
74+
75+
## Apache Spark + Comet
76+
77+
```shell
78+
#!/bin/bash
79+
$SPARK_HOME/bin/spark-submit \
80+
--master $SPARK_MASTER \
81+
--conf spark.driver.memory=8G \
82+
--conf spark.executor.instances=1 \
83+
--conf spark.executor.cores=8 \
84+
--conf spark.cores.max=8 \
85+
--conf spark.executor.memory=16g \
86+
--conf spark.memory.offHeap.enabled=true \
87+
--conf spark.memory.offHeap.size=16g \
88+
--conf spark.comet.exec.replaceSortMergeJoin=true \
89+
--conf spark.eventLog.enabled=true \
90+
--jars $COMET_JAR \
91+
--driver-class-path $COMET_JAR \
92+
--conf spark.driver.extraClassPath=$COMET_JAR \
93+
--conf spark.executor.extraClassPath=$COMET_JAR \
94+
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
95+
--conf spark.comet.enabled=true \
96+
--conf spark.comet.exec.shuffle.enabled=true \
97+
--conf spark.comet.exec.shuffle.mode=auto \
98+
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
99+
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
100+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
101+
tpcbench.py \
102+
--name comet \
103+
--benchmark tpch \
104+
--data /mnt/bigdata/tpch/sf100/ \
105+
--queries ../../tpch/queries \
106+
--output . \
107+
--iterations 5
108+
```

docs/source/contributor-guide/benchmarking.md

Lines changed: 0 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -24,66 +24,6 @@ benchmarking documentation and scripts are available in the [DataFusion Benchmar
2424

2525
We also have many micro benchmarks that can be run from an IDE located [here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).
2626

27-
Here are example commands for running the benchmarks against a Spark cluster. This command will need to be
28-
adapted based on the Spark environment and location of data files.
29-
30-
These commands are intended to be run from the `runners/datafusion-comet` directory in the `datafusion-benchmarks`
31-
repository.
32-
33-
## Running Benchmarks Against Apache Spark
34-
35-
```shell
36-
$SPARK_HOME/bin/spark-submit \
37-
--master $SPARK_MASTER \
38-
--conf spark.driver.memory=8G \
39-
--conf spark.executor.instances=1 \
40-
--conf spark.executor.memory=32G \
41-
--conf spark.executor.cores=8 \
42-
--conf spark.cores.max=8 \
43-
tpcbench.py \
44-
--benchmark tpch \
45-
--data /mnt/bigdata/tpch/sf100/ \
46-
--queries ../../tpch/queries \
47-
--iterations 3
48-
```
49-
50-
## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled
51-
52-
### TPC-H
53-
54-
```shell
55-
$SPARK_HOME/bin/spark-submit \
56-
--master $SPARK_MASTER \
57-
--conf spark.driver.memory=8G \
58-
--conf spark.executor.instances=1 \
59-
--conf spark.executor.memory=16G \
60-
--conf spark.executor.cores=8 \
61-
--conf spark.cores.max=8 \
62-
--conf spark.memory.offHeap.enabled=true \
63-
--conf spark.memory.offHeap.size=16g \
64-
--jars $COMET_JAR \
65-
--conf spark.driver.extraClassPath=$COMET_JAR \
66-
--conf spark.executor.extraClassPath=$COMET_JAR \
67-
--conf spark.plugins=org.apache.spark.CometPlugin \
68-
--conf spark.comet.cast.allowIncompatible=true \
69-
--conf spark.comet.exec.replaceSortMergeJoin=true \
70-
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
71-
--conf spark.comet.exec.shuffle.enabled=true \
72-
--conf spark.comet.exec.shuffle.mode=auto \
73-
--conf spark.comet.exec.shuffle.enableFastEncoding=true \
74-
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
75-
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
76-
tpcbench.py \
77-
--benchmark tpch \
78-
--data /mnt/bigdata/tpch/sf100/ \
79-
--queries ../../tpch/queries \
80-
--iterations 3
81-
```
82-
83-
### TPC-DS
84-
85-
For TPC-DS, use `spark.comet.exec.replaceSortMergeJoin=false`.
86-
8727
## Current Benchmark Results
8828

8929
- [Benchmarks derived from TPC-H](benchmark-results/tpc-h)

0 commit comments

Comments
 (0)