Skip to content

Commit 73ecb40

Browse files
authored
Merge branch 'main' into support_overwrite_comet_io
2 parents 47bc3bc + 4bd664e commit 73ecb40

File tree

40 files changed

+3718
-813
lines changed

40 files changed

+3718
-813
lines changed

.github/workflows/pr_build_linux.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ jobs:
122122
org.apache.comet.exec.CometAsyncShuffleSuite
123123
org.apache.comet.exec.DisableAQECometShuffleSuite
124124
org.apache.comet.exec.DisableAQECometAsyncShuffleSuite
125+
org.apache.spark.shuffle.sort.SpillSorterSuite
125126
- name: "parquet"
126127
value: |
127128
org.apache.comet.parquet.CometParquetWriterSuite
@@ -160,6 +161,7 @@ jobs:
160161
value: |
161162
org.apache.comet.CometExpressionSuite
162163
org.apache.comet.CometExpressionCoverageSuite
164+
org.apache.comet.CometHashExpressionSuite
163165
org.apache.comet.CometTemporalExpressionSuite
164166
org.apache.comet.CometArrayExpressionSuite
165167
org.apache.comet.CometCastSuite

.github/workflows/pr_build_macos.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ jobs:
8585
org.apache.comet.exec.CometAsyncShuffleSuite
8686
org.apache.comet.exec.DisableAQECometShuffleSuite
8787
org.apache.comet.exec.DisableAQECometAsyncShuffleSuite
88+
org.apache.spark.shuffle.sort.SpillSorterSuite
8889
- name: "parquet"
8990
value: |
9091
org.apache.comet.parquet.CometParquetWriterSuite
@@ -123,6 +124,7 @@ jobs:
123124
value: |
124125
org.apache.comet.CometExpressionSuite
125126
org.apache.comet.CometExpressionCoverageSuite
127+
org.apache.comet.CometHashExpressionSuite
126128
org.apache.comet.CometTemporalExpressionSuite
127129
org.apache.comet.CometArrayExpressionSuite
128130
org.apache.comet.CometCastSuite

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,5 @@ apache-rat-*.jar
1818
venv
1919
dev/release/comet-rm/workdir
2020
spark/benchmarks
21+
.DS_Store
22+
comet-event-trace.json

benchmarks/pyspark/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Shuffle Size Comparison Benchmark
21+
22+
Compares shuffle file sizes between Spark, Comet JVM, and Comet Native shuffle implementations.
23+
24+
## Prerequisites
25+
26+
- Apache Spark cluster (standalone, YARN, or Kubernetes)
27+
- PySpark installed
28+
- Comet JAR built
29+
30+
## Build Comet JAR
31+
32+
```bash
33+
cd /path/to/datafusion-comet
34+
make release
35+
```
36+
37+
## Step 1: Generate Test Data
38+
39+
Generate test data with realistic 50-column schema (nested structs, arrays, maps):
40+
41+
```bash
42+
spark-submit \
43+
--master spark://master:7077 \
44+
--executor-memory 16g \
45+
generate_data.py \
46+
--output /tmp/shuffle-benchmark-data \
47+
--rows 10000000 \
48+
--partitions 200
49+
```
50+
51+
### Data Generation Options
52+
53+
| Option | Default | Description |
54+
| -------------------- | ---------- | ---------------------------- |
55+
| `--output`, `-o` | (required) | Output path for parquet data |
56+
| `--rows`, `-r` | 10000000 | Number of rows |
57+
| `--partitions`, `-p` | 200 | Number of output partitions |
58+
59+
## Step 2: Run Benchmark
60+
61+
Run benchmarks and check Spark UI for shuffle sizes:
62+
63+
```bash
64+
SPARK_MASTER=spark://master:7077 \
65+
EXECUTOR_MEMORY=16g \
66+
./run_all_benchmarks.sh /tmp/shuffle-benchmark-data
67+
```
68+
69+
Or run individual modes:
70+
71+
```bash
72+
# Spark baseline
73+
spark-submit --master spark://master:7077 \
74+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode spark
75+
76+
# Comet JVM shuffle
77+
spark-submit --master spark://master:7077 \
78+
--jars /path/to/comet.jar \
79+
--conf spark.comet.enabled=true \
80+
--conf spark.comet.exec.shuffle.enabled=true \
81+
--conf spark.comet.shuffle.mode=jvm \
82+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
83+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode jvm
84+
85+
# Comet Native shuffle
86+
spark-submit --master spark://master:7077 \
87+
--jars /path/to/comet.jar \
88+
--conf spark.comet.enabled=true \
89+
--conf spark.comet.exec.shuffle.enabled=true \
90+
--conf spark.comet.shuffle.mode=native \
91+
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
92+
run_benchmark.py --data /tmp/shuffle-benchmark-data --mode native
93+
```
94+
95+
## Checking Results
96+
97+
Open the Spark UI (default: http://localhost:4040) during each benchmark run to compare shuffle write sizes in the Stages tab.

0 commit comments

Comments
 (0)