@@ -17,9 +17,16 @@ specific language governing permissions and limitations
1717under the License.
1818-->
1919
20- # Shuffle Size Comparison Benchmark
20+ # PySpark Benchmarks
2121
22- Compares shuffle file sizes between Spark, Comet JVM, and Comet Native shuffle implementations.
22+ A suite of PySpark benchmarks for comparing performance between Spark, Comet JVM, and Comet Native implementations.
23+
24+ ## Available Benchmarks
25+
26+ Run ` python run_benchmark.py --list-benchmarks ` to see all available benchmarks:
27+
28+ - ** shuffle-hash** - Shuffle all columns using hash partitioning on group_key
29+ - ** shuffle-roundrobin** - Shuffle all columns using round-robin partitioning
2330
2431## Prerequisites
2532
@@ -56,42 +63,116 @@ spark-submit \
5663| ` --rows ` , ` -r ` | 10000000 | Number of rows |
5764| ` --partitions ` , ` -p ` | 200 | Number of output partitions |
5865
59- ## Step 2: Run Benchmark
66+ ## Step 2: Run Benchmarks
6067
61- Run benchmarks and check Spark UI for shuffle sizes:
68+ ### List Available Benchmarks
6269
6370``` bash
64- SPARK_MASTER=spark://master:7077 \
65- EXECUTOR_MEMORY=16g \
66- ./run_all_benchmarks.sh /tmp/shuffle-benchmark-data
71+ python run_benchmark.py --list-benchmarks
6772```
6873
69- Or run individual modes:
74+ ### Run Individual Benchmarks
75+
76+ You can run specific benchmarks by name:
7077
7178``` bash
72- # Spark baseline
79+ # Hash partitioning shuffle - Spark baseline
7380spark-submit --master spark://master:7077 \
74- run_benchmark.py --data /tmp/shuffle-benchmark-data --mode spark
81+ run_benchmark.py --data /tmp/shuffle-benchmark-data --mode spark --benchmark shuffle-hash
7582
76- # Comet JVM shuffle
83+ # Round-robin shuffle - Spark baseline
84+ spark-submit --master spark://master:7077 \
85+ run_benchmark.py --data /tmp/shuffle-benchmark-data --mode spark --benchmark shuffle-roundrobin
86+
87+ # Hash partitioning - Comet JVM shuffle
7788spark-submit --master spark://master:7077 \
7889 --jars /path/to/comet.jar \
7990 --conf spark.comet.enabled=true \
8091 --conf spark.comet.exec.shuffle.enabled=true \
8192 --conf spark.comet.shuffle.mode=jvm \
8293 --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
83- run_benchmark.py --data /tmp/shuffle-benchmark-data --mode jvm
94+ run_benchmark.py --data /tmp/shuffle-benchmark-data --mode jvm --benchmark shuffle-hash
8495
85- # Comet Native shuffle
96+ # Round-robin - Comet Native shuffle
8697spark-submit --master spark://master:7077 \
8798 --jars /path/to/comet.jar \
8899 --conf spark.comet.enabled=true \
89100 --conf spark.comet.exec.shuffle.enabled=true \
90- --conf spark.comet.shuffle.mode=native \
101+ --conf spark.comet.exec. shuffle.mode=native \
91102 --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
92- run_benchmark.py --data /tmp/shuffle-benchmark-data --mode native
103+ run_benchmark.py --data /tmp/shuffle-benchmark-data --mode native --benchmark shuffle-roundrobin
104+ ```
105+
106+ ### Run All Benchmarks
107+
108+ Use the provided script to run all benchmarks across all modes:
109+
110+ ``` bash
111+ SPARK_MASTER=spark://master:7077 \
112+ EXECUTOR_MEMORY=16g \
113+ ./run_all_benchmarks.sh /tmp/shuffle-benchmark-data
93114```
94115
95116## Checking Results
96117
97118Open the Spark UI (default: http://localhost:4040 ) during each benchmark run to compare shuffle write sizes in the Stages tab.
119+
120+ ## Adding New Benchmarks
121+
122+ The benchmark framework makes it easy to add new benchmarks:
123+
124+ 1 . ** Create a benchmark class** in ` benchmarks/ ` directory (or add to existing file):
125+
126+ ``` python
127+ from benchmarks.base import Benchmark
128+
129+ class MyBenchmark (Benchmark ):
130+ @ classmethod
131+ def name (cls ) -> str :
132+ return " my-benchmark"
133+
134+ @ classmethod
135+ def description (cls ) -> str :
136+ return " Description of what this benchmark does"
137+
138+ def run (self ) -> Dict[str , Any]:
139+ # Read data
140+ df = self .spark.read.parquet(self .data_path)
141+
142+ # Run your benchmark operation
143+ def benchmark_operation ():
144+ result = df.filter(... ).groupBy(... ).agg(... )
145+ result.write.mode(" overwrite" ).parquet(" /tmp/output" )
146+
147+ # Time it
148+ duration_ms = self ._time_operation(benchmark_operation)
149+
150+ return {
151+ ' duration_ms' : duration_ms,
152+ # Add any other metrics you want to track
153+ }
154+ ```
155+
156+ 2 . ** Register the benchmark** in ` benchmarks/__init__.py ` :
157+
158+ ``` python
159+ from .my_module import MyBenchmark
160+
161+ _BENCHMARK_REGISTRY = {
162+ # ... existing benchmarks
163+ MyBenchmark.name(): MyBenchmark,
164+ }
165+ ```
166+
167+ 3 . ** Run your new benchmark** :
168+
169+ ``` bash
170+ python run_benchmark.py --data /path/to/data --mode spark --benchmark my-benchmark
171+ ```
172+
173+ The base ` Benchmark ` class provides:
174+
175+ - Automatic timing via ` _time_operation() `
176+ - Standard output formatting via ` execute_timed() `
177+ - Access to SparkSession, data path, and mode
178+ - Spark configuration printing
0 commit comments