Skip to content

Commit f9c030a

Browse files
alambmartin-g
andauthored
Automatically download tpcds benchmark data to the right place (#19244)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #19243 ## Rationale for this change I want to be able to run tpcdb benchmarks added by @comphead as part of my benchmark automation scripts. To do so I need to be able to run `bench.sh data tpchds` and have it automatically generate the data if it is not present. Right now the data generation step is manual. ```shell andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./benchmarks/bench.sh data tpcds *************************** DataFusion Benchmark Runner and Data Generator COMMAND: data BENCHMARK: tpcds DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data CARGO_COMMAND: cargo run --release PREFER_HASH_JOIN: true *************************** For TPC-DS data generation, please clone the datafusion-benchmarks repository: git clone https://github.com/apache/datafusion-benchmarks ``` And I think it takes some more post processing steps (which is what @mbutrovich hit) ## What changes are included in this PR? 1. Update the data setup portion to automatically download the contents from github and extract it in the correct location ## Are these changes tested? I tested this manually on my mac laptop by deleting the data directory and running the script again, and deleting the web_*.parquet files to ensure they are re-downloaded correctly. ```shell ./benchmarks/bench.sh data tpcds ./benchmarks/bench.sh run tpcds ``` I also tested on my benchmark machine (linux) ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Martin Grigorov <[email protected]>
1 parent a3b3eb5 commit f9c030a

File tree

1 file changed

+19
-28
lines changed

1 file changed

+19
-28
lines changed

benchmarks/bench.sh

Lines changed: 19 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -629,22 +629,24 @@ data_tpch() {
629629
exit 1
630630
}
631631

632-
# Points to TPCDS data generation instructions
632+
# Downloads TPC-DS data
633633
data_tpcds() {
634-
TPCDS_DIR="${DATA_DIR}"
635-
636-
# Check if TPCDS data directory exists
637-
if [ ! -d "${TPCDS_DIR}" ]; then
638-
echo ""
639-
echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
640-
echo " git clone https://github.com/apache/datafusion-benchmarks"
641-
echo ""
642-
return 1
634+
TPCDS_DIR="${DATA_DIR}/tpcds_sf1"
635+
636+
# Check if `web_site.parquet` exists in the TPCDS data directory to verify data presence
637+
echo "Checking TPC-DS data directory: ${TPCDS_DIR}"
638+
if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then
639+
mkdir -p "${TPCDS_DIR}"
640+
# Download the DataFusion benchmarks repository zip if it is not already downloaded
641+
if [ ! -f "${DATA_DIR}/datafusion-benchmarks.zip" ]; then
642+
echo "Downloading DataFusion benchmarks repository zip to: ${DATA_DIR}/datafusion-benchmarks.zip"
643+
wget --timeout=30 --tries=3 -O "${DATA_DIR}/datafusion-benchmarks.zip" https://github.com/apache/datafusion-benchmarks/archive/refs/heads/main.zip
644+
fi
645+
echo "Extracting TPC-DS parquet data to ${TPCDS_DIR}..."
646+
unzip -o -j -d "${TPCDS_DIR}" "${DATA_DIR}/datafusion-benchmarks.zip" datafusion-benchmarks-main/tpcds/data/sf1/*
647+
echo "TPC-DS data extracted."
643648
fi
644-
645-
echo ""
646-
echo "TPC-DS data already exists in ${TPCDS_DIR}"
647-
echo ""
649+
echo "Done."
648650
}
649651

650652
# Runs the tpch benchmark
@@ -682,21 +684,10 @@ run_tpch_mem() {
682684

683685
# Runs the tpcds benchmark
684686
run_tpcds() {
685-
TPCDS_DIR="${DATA_DIR}"
686-
687-
# Check if TPCDS data directory exists
688-
if [ ! -d "${TPCDS_DIR}" ]; then
689-
echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2
690-
echo "" >&2
691-
echo "Please prepare TPC-DS data first by following instructions:" >&2
692-
echo " ./bench.sh data tpcds" >&2
693-
echo "" >&2
694-
exit 1
695-
fi
687+
TPCDS_DIR="${DATA_DIR}/tpcds_sf1"
696688

697-
# Check if directory contains parquet files
698-
if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then
699-
echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2
689+
# Check if TPCDS data directory and representative file exists
690+
if [ ! -f "${TPCDS_DIR}/web_site.parquet" ]; then
700691
echo "" >&2
701692
echo "Please prepare TPC-DS data first by following instructions:" >&2
702693
echo " ./bench.sh data tpcds" >&2

0 commit comments

Comments
 (0)