@@ -23,16 +23,15 @@ This crate contains benchmarks based on popular public data sets and
2323open source benchmark suites, to help with performance and scalability
2424testing of DataFusion.
2525
26-
2726## Other engines
2827
2928The benchmarks measure changes to DataFusion itself, rather than
3029its performance against other engines. For competitive benchmarking,
3130DataFusion is included in the benchmark setups for several popular
3231benchmarks that compare performance with other engines. For example:
3332
34- * [ ClickBench] scripts are in the [ ClickBench repo] ( https://github.com/ClickHouse/ClickBench/tree/main/datafusion )
35- * [ H2o.ai ` db-benchmark ` ] scripts are in [ db-benchmark] ( https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs )
33+ - [ ClickBench] scripts are in the [ ClickBench repo] ( https://github.com/ClickHouse/ClickBench/tree/main/datafusion )
34+ - [ H2o.ai ` db-benchmark ` ] scripts are in [ db-benchmark] ( https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs )
3635
3736[ ClickBench ] : https://github.com/ClickHouse/ClickBench/tree/main
3837[ H2o.ai `db-benchmark` ] : https://github.com/h2oai/db-benchmark
@@ -65,39 +64,50 @@ Create / download a specific dataset (TPCH)
6564``` shell
6665./bench.sh data tpch
6766```
67+
6868Data is placed in the ` data ` subdirectory.
6969
7070## Running benchmarks
7171
7272Run benchmark for TPC-H dataset
73+
7374``` shell
7475./bench.sh run tpch
7576```
77+
7678or for TPC-H dataset scale 10
79+
7780``` shell
7881./bench.sh run tpch10
7982```
8083
8184To run for specific query, for example Q21
85+
8286``` shell
8387./bench.sh run tpch10 21
8488```
8589
8690## Benchmark with modified configurations
91+
8792### Select join algorithm
93+
8894The benchmark runs with ` prefer_hash_join == true ` by default, which enforces HASH join algorithm.
8995To run TPCH benchmarks with join other than HASH:
96+
9097``` shell
9198PREFER_HASH_JOIN=false ./bench.sh run tpch
9299```
93100
94101### Configure with environment variables
95- Any [ datafusion options] ( https://datafusion.apache.org/user-guide/configs.html ) that are provided environment variables are
102+
103+ Any [ datafusion options] ( https://datafusion.apache.org/user-guide/configs.html ) that are provided environment variables are
96104also considered by the benchmarks.
97- The following configuration runs the TPCH benchmark with datafusion configured to * not* repartition join keys.
105+ The following configuration runs the TPCH benchmark with datafusion configured to _ not_ repartition join keys.
106+
98107``` shell
99108DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch
100109```
110+
101111You might want to adjust the results location to avoid overwriting previous results.
102112Environment configuration that was picked up by datafusion is logged at ` info ` level.
103113To verify that datafusion picked up your configuration, run the benchmarks with ` RUST_LOG=info ` or higher.
@@ -419,7 +429,7 @@ logs.
419429
420430Example
421431
422- dfbench parquet-filter --path ./data --scale-factor 1.0
432+ dfbench parquet-filter --path ./data --scale-factor 1.0
423433
424434generates the synthetic dataset at ` ./data/logs.parquet ` . The size
425435of the dataset can be controlled through the ` size_factor `
@@ -451,6 +461,7 @@ Iteration 2 returned 1781686 rows in 1947 ms
451461```
452462
453463## Sort
464+
454465Test performance of sorting large datasets
455466
456467This test sorts a a synthetic dataset generated during the
@@ -474,22 +485,27 @@ Additionally, an optional `--limit` flag is available for the sort benchmark. Wh
474485See [ ` sort_tpch.rs ` ] ( src/sort_tpch.rs ) for more details.
475486
476487### Sort TPCH Benchmark Example Runs
488+
4774891 . Run all queries with default setting:
490+
478491``` bash
479492 cargo run --release --bin dfbench -- sort-tpch -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json'
480493```
481494
4824952 . Run a specific query:
496+
483497``` bash
484498 cargo run --release --bin dfbench -- sort-tpch -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json' --query 2
485499```
486500
4875013 . Run all queries as TopK queries on presorted data:
502+
488503``` bash
489504 cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p ' ./datafusion/benchmarks/data/tpch_sf1' -o ' /tmp/sort_tpch.json'
490505```
491506
4925074 . Run all queries with ` bench.sh ` script:
508+
493509``` bash
494510./bench.sh run sort_tpch
495511```
@@ -527,73 +543,86 @@ External aggregation benchmarks run several aggregation queries with different m
527543This benchmark is inspired by [ DuckDB's external aggregation paper] ( https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf ) , specifically Section VI.
528544
529545### External Aggregation Example Runs
546+
5305471 . Run all queries with predefined memory limits:
548+
531549``` bash
532550# Under 'benchmarks/' directory
533551cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p ' ....../data/tpch_sf1' -o ' /tmp/aggr.json'
534552```
535553
5365542 . Run a query with specific memory limit:
555+
537556``` bash
538557cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p ' ....../data/tpch_sf1' -o ' /tmp/aggr.json' --query 1 --memory-limit 30M
539558```
540559
5415603 . Run all queries with ` bench.sh ` script:
561+
542562``` bash
543563./bench.sh data external_aggr
544564./bench.sh run external_aggr
545565```
546566
547-
548567## h2o.ai benchmarks
568+
549569The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
550570
551571Reference:
572+
552573- [ H2O AI Benchmark] ( https://duckdb.org/2023/04/14/h2oai.html )
553574- [ Extended window benchmark] ( https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark )
554575
555576### h2o benchmarks for groupby
556577
557578#### Generate data for h2o benchmarks
579+
558580There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` . The data is generated in the ` data ` directory.
559581
5605821 . Generate small data (1e7 rows)
583+
561584``` bash
562585./bench.sh data h2o_small
563586```
564587
565-
5665882 . Generate medium data (1e8 rows)
589+
567590``` bash
568591./bench.sh data h2o_medium
569592```
570593
571-
5725943 . Generate large data (1e9 rows)
595+
573596``` bash
574597./bench.sh data h2o_big
575598```
576599
577600#### Run h2o benchmarks
601+
578602There are three options for running h2o benchmarks: ` small ` , ` medium ` , and ` big ` .
603+
5796041 . Run small data benchmark
605+
580606``` bash
581607./bench.sh run h2o_small
582608```
583609
5846102 . Run medium data benchmark
611+
585612``` bash
586613./bench.sh run h2o_medium
587614```
588615
5896163 . Run large data benchmark
617+
590618``` bash
591619./bench.sh run h2o_big
592620```
593621
5946224 . Run a specific query with a specific data path
595623
596624For example, to run query 1 with the small data generated above:
625+
597626``` bash
598627cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
599628```
@@ -602,7 +631,7 @@ cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7
602631
603632There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` . The data is generated in the ` data ` directory.
604633
605- Here is a example to generate ` small ` dataset and run the benchmark. To run other
634+ Here is a example to generate ` small ` dataset and run the benchmark. To run other
606635dataset size configuration, change the command similar to the previous example.
607636
608637``` bash
@@ -616,6 +645,7 @@ dataset size configuration, change the command similar to the previous example.
616645To run a specific query with a specific join data paths, the data paths are including 4 table files.
617646
618647For example, to run query 1 with the small data generated above:
648+
619649``` bash
620650cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
621651```
@@ -624,7 +654,7 @@ cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1
624654
625655This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: ` small ` , ` medium ` , and ` big ` .
626656
627- Here is a example to generate ` small ` dataset and run the benchmark. To run other
657+ Here is a example to generate ` small ` dataset and run the benchmark. To run other
628658dataset size configuration, change the command similar to the previous example.
629659
630660``` bash
@@ -638,6 +668,7 @@ dataset size configuration, change the command similar to the previous example.
638668To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
639669
640670For example, to run query 1 with the small data generated above:
671+
641672``` bash
642673cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
643674```
0 commit comments