This repo includes the code and experiment in the paper Understanding the Performance of Spark Native Execution: The Good, the Bad, and How to Fix It
The experiment includes the performance evaluation of performance-critical operators microbenchmark and TPC-DS of Vanilla Spark, Spark+Velox backend, Spark+ClickHouse backend and Spark+DataFusion backend.
Please go to SparkDownload, download spark-3.3.1
wget https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
Please go to Gluten to download the version v1.1.1
wget https://github.com/apache/incubator-gluten/releases/download/v1.1.1/gluten-velox-bundle-spark3.3_2.12-1.1.1.jar
cp gluten-velox-bundle-spark3.3_2.12-1.1.1.jar /spark-3.3.1/jars/
Or you can build the Gluten on your own server. For Velox build, please refer to VeloxBuild, For ClickHouse build, please refer to ClickHouseBuild
# Velox Build
./dev/builddeps-veloxbe.sh build_arrow
./dev/builddeps-veloxbe.sh build_velox
./dev/builddeps-veloxbe.sh build_gluten_cpp
mvn clean package -Pbackends-velox -Pceleborn -Puniffle -Pspark-3.3 -DskipTests
# ClickHouse Build
bash ./ep/build-clickhouse/src/build_clickhouse.sh
export MAVEN_OPTS="-Xmx8g -XX:ReservedCodeCacheSize=2g"
mvn clean install -Pbackends-clickhouse -Pspark-3.3 -DskipTests -Dcheckstyle.skip
ls -al backends-clickhouse/target/gluten-XXXXX-spark-3.3-jar-with-dependencies.jar
Please refer to Blaze, download the version of v4.0.0
git clone [email protected]:kwai/blaze.git
cd blaze
SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5
MODE=release # or pre
mvn clean package -P"${SHIM}" -P"${MODE}"
Copy the jar to the Spark's jar directory
cp gluten-package-1.3.0-SNAPSHOT.jar /spark-3.3.1/jars/
or
cp blaze-engine-spark-3.3-release-4.0.0-SNAPSHOT.jar /spark-3.3.1/jars/
Please refer to: https://github.com/apache/incubator-gluten/blob/main/tools/workload/tpcds/README.md
or
git clone https://github.com/satyakommula96/spark_benchmark.git
git submodule init
git submodule update
cd spark-sql-perf
cp /tmp/sbt/bin/sbt-launch.jar build/sbt-launch-0.13.18.jar
bin/run
sbt +package
cd ../tpcds-kit/tools
make CC=gcc-9 OS=LINUX
cd ../tpcds
#For generating ~100GB parquet data
./gendata_parquet.sh
# For runing all 99 TPC-DS Queries
./runtpch_parquet.sh
cd ../tpcds
#For generating ~100GB orc data
./gendata_orc.sh
# For runing all 99 TPC-DS Queries
./runtpch_orc.sh
The queries used in the experiment: please refer to: https://github.com/apache/incubator-gluten/tree/main/tools/gluten-it/common/src/main/resources/tpcds-queries
Belows are the command to run the spark clusters.
/spark-3.3.1-bin-hadoop2-ck/bin/spark-shell\
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.codegen.wholeStage=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=20g \
--executor-cores 4 \
--conf spark.local.dir=/localssd/hza214 \
--conf spark.driver.memoryOverhead=4g\
--conf spark.executor.memory=16g\
--conf spark.executor.memoryOverhead=4g\
--driver-memory 40g
/spark-3.3.1-bin-hadoop3-velox/bin/spark-shell --conf spark.gluten.enabled=true
--conf spark.local.dir=/localssd/hza214
--conf spark.plugins=org.apache.gluten.GlutenPlugin
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.codegen.wholeStage=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=20g \
--executor-cores 4 \
--conf spark.local.dir=/localssd/hza214 \
--conf spark.driver.memoryOverhead=4g\
--conf spark.executor.memory=16g\
--conf spark.executor.memoryOverhead=4g\
--driver-memory 40g
/spark-3.3.1-bin-hadoop2-ck/bin/spark-shell\
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.codegen.wholeStage=true \
--conf spark.plugins=org.apache.gluten.GlutenPlugin \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=20g \
--conf spark.executorEnv.LD_PRELOAD=/localhdd/hza214/gluten/cpp-ch/build/utils/extern-local-engine/libch.so\
--conf spark.gluten.sql.columnar.libpath=/localhdd/hza214/gluten/cpp-ch/build/utils/extern-local-engine/libch.so \
--conf spark.gluten.sql.columnar.iterator=true \
--conf spark.gluten.sql.columnar.loadarrow=false \
--conf spark.gluten.sql.columnar.hashagg.enablefinal=true \
--conf spark.gluten.sql.enable.native.validation=false \
--conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.execution.datasources.v2.clickhouse.ClickHouseSparkCatalog \
--conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
--executor-cores 4 \
--conf spark.local.dir=/localssd/hza214 \
--conf spark.driver.memoryOverhead=4g\
--conf spark.executor.memory=16g\
--conf spark.executor.memoryOverhead=4g\
--driver-memory 40g
./spark-shell --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.auron.smjfallback.enable=false --conf spark.driver.memory=20g -park.driver.memoryOverhead=4096 --conf spark.executor.instances=10000 --conf spark.dynamicallocation.maxExecutors=10000 --conf spark.executor.cores=8 --conf spark.io.compression.codec=lz4 --conf spark.sql.parquet.compression.codec=zstd --conf spark.executor.memory=8g --conf spark.executor.memoryOverhead=16384 --conf spark.shuffle.manager=org.apache.spark.sql.execution.auron.shuffle.AuronShuffleManager --conf spark.memory.offHeap.enabled=false --conf spark.auron.memoryFraction=0.8 --conf spark.auron.process.vmrss.memoryFraction=0.8 --conf spark.auron.tokio.worker.threads.per.cpu=1 --conf spark.auron.forceShuffledHashJoin=true --conf spark.auron.smjfallback.mem.threshold=512000000 --conf spark.auron.udafFallback.enable=true --conf spark.auron.partialAggSkipping.skipSpill=true
Execute run_tpcds_orc.scala or run_tpcds_parquet.scala, which will run all the TPC-DS queries and save the time to a txt file.
Please go to TPC-DS E2E Profiling
Run with different engine's configuration and then execute run_tpcds_orc.scala or run_tpcds_parquet.scala in the spark shell (just copy and paste), which will run all the TPC-DS queries and save the time to a txt file.
Please go to Microbenchmark of HashJoin
Please go to Microbenchmark of HashAggregation
Please go to Microbenchmark of TableScan
Please go to Cost Model Evaluation
Please go to FPGA Accelerated Evaluation
/localhdd/hza214/BenchmarkPaperCode rcl-3