Skip to content

Commit 8ee46ab

Browse files
authored
Update README for running benchmarks in k8s (#39)
1 parent 74ec45c commit 8ee46ab

File tree

3 files changed

+120
-5
lines changed

3 files changed

+120
-5
lines changed

tpch/Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM apache/datafusion-ray
2+
3+
RUN sudo apt update && \
4+
sudo apt install -y git
5+
6+
RUN git clone https://github.com/apache/datafusion-benchmarks.git

tpch/README.md

Lines changed: 96 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,103 @@
2121

2222
## Running Benchmarks
2323

24+
### Standalone Ray Cluster
25+
2426
Data and queries must be available on all nodes of the Ray cluster.
2527

2628
```shell
27-
RAY_ADDRESS='http://ray-cluster-ip-address:8265' ray job submit --working-dir `pwd` -- python3 tpcbench.py --benchmark tpch --data /path/to/data --queries /path/to/tpch/queries --concurrency 4
29+
RAY_ADDRESS='http://ray-cluster-ip-address:8265' ray job submit --working-dir `pwd` -- python3 tpcbench.py --benchmark tpch --data /path/to/data --queries /path/to/tpch/queries
30+
```
31+
32+
### Kubernetes
33+
34+
Create a Docker image containing the TPC-H queries and push to a Docker registry that is accessible from the k8s cluster.
35+
36+
```shell
37+
docker build -t YOURREPO/datafusion-ray-tpch .
38+
```
39+
40+
If the data files are local to the k8s nodes, then create a persistent volume and persistent volume claim.
41+
42+
Create a `pv.yaml` with the following content and run `kubectl apply -f pv.yaml`.
43+
44+
```yaml
45+
apiVersion: v1
46+
kind: PersistentVolume
47+
metadata:
48+
name: ray-pv
49+
spec:
50+
storageClassName: manual
51+
capacity:
52+
storage: 10Gi
53+
accessModes:
54+
- ReadWriteOnce
55+
hostPath:
56+
path: "/mnt/bigdata" # Adjust the path as needed
57+
```
58+
59+
Create a `pvc.yaml` with the following content and run `kubectl apply -f pvc.yaml`.
60+
61+
```yaml
62+
apiVersion: v1
63+
kind: PersistentVolumeClaim
64+
metadata:
65+
name: ray-pvc
66+
spec:
67+
storageClassName: manual # Should match the PV's storageClassName if static
68+
accessModes:
69+
- ReadWriteOnce
70+
resources:
71+
requests:
72+
storage: 10Gi
73+
```
74+
75+
Create the Ray cluster using the custom image.
76+
77+
Create a `ray-cluster.yaml` with the following content and run `kubectl apply -f ray-cluster.yaml`.
78+
79+
```yaml
80+
apiVersion: ray.io/v1alpha1
81+
kind: RayCluster
82+
metadata:
83+
name: datafusion-ray-cluster
84+
spec:
85+
headGroupSpec:
86+
rayStartParams:
87+
num-cpus: "1"
88+
template:
89+
spec:
90+
containers:
91+
- name: ray-head
92+
image: YOURREPO/datafusion-ray-tpch:latest
93+
volumeMounts:
94+
- mountPath: /mnt/bigdata # Mount path inside the container
95+
name: ray-storage
96+
volumes:
97+
- name: ray-storage
98+
persistentVolumeClaim:
99+
claimName: ray-pvc # Reference the PVC name here
100+
workerGroupSpecs:
101+
- replicas: 2
102+
groupName: "datafusion-ray"
103+
rayStartParams:
104+
num-cpus: "4"
105+
template:
106+
spec:
107+
containers:
108+
- name: ray-worker
109+
image: YOURREPO/datafusion-ray-tpch:latest
110+
volumeMounts:
111+
- mountPath: /mnt/bigdata
112+
name: ray-storage
113+
volumes:
114+
- name: ray-storage
115+
persistentVolumeClaim:
116+
claimName: ray-pvc
117+
```
118+
119+
Run the benchmarks
120+
121+
```shell
122+
ray job submit --working-dir `pwd` -- python3 tpcbench.py --benchmark tpch --queries /home/ray/datafusion-benchmarks/tpch/queries/ --data /mnt/bigdata/tpch/sf100
28123
```

tpch/tpcbench.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
import argparse
1919
import ray
20+
from datafusion import SessionContext, SessionConfig, RuntimeConfig
2021
from datafusion_ray import DatafusionRayContext
2122
from datetime import datetime
2223
import json
@@ -41,22 +42,32 @@ def main(benchmark: str, data_path: str, query_path: str, concurrency: int):
4142
# use ray job submit
4243
ray.init()
4344

44-
ctx = DatafusionRayContext(concurrency)
45+
runtime = (
46+
RuntimeConfig()
47+
)
48+
config = (
49+
SessionConfig()
50+
.with_target_partitions(concurrency)
51+
.set("datafusion.execution.parquet.pushdown_filters", "true")
52+
)
53+
df_ctx = SessionContext(config, runtime)
54+
55+
ray_ctx = DatafusionRayContext(df_ctx)
4556

4657
for table in table_names:
4758
path = f"{data_path}/{table}.parquet"
4859
print(f"Registering table {table} using path {path}")
49-
ctx.register_parquet(table, path)
60+
df_ctx.register_parquet(table, path)
5061

5162
results = {
5263
'engine': 'datafusion-python',
5364
'benchmark': benchmark,
5465
'data_path': data_path,
5566
'query_path': query_path,
56-
'concurrency': concurrency,
5767
}
5868

5969
for query in range(1, num_queries + 1):
70+
6071
# read text file
6172
path = f"{query_path}/q{query}.sql"
6273
print(f"Reading query {query} using path {path}")
@@ -70,7 +81,7 @@ def main(benchmark: str, data_path: str, query_path: str, concurrency: int):
7081
sql = sql.strip()
7182
if len(sql) > 0:
7283
print(f"Executing: {sql}")
73-
rows = ctx.sql(sql)
84+
rows = ray_ctx.sql(sql)
7485

7586
print(f"Query {query} returned {len(rows)} rows")
7687
end_time = time.time()
@@ -86,6 +97,9 @@ def main(benchmark: str, data_path: str, query_path: str, concurrency: int):
8697
with open(results_path, "w") as f:
8798
f.write(str)
8899

100+
# write results to stdout
101+
print(str)
102+
89103
if __name__ == "__main__":
90104
parser = argparse.ArgumentParser(description="DataFusion benchmark derived from TPC-H / TPC-DS")
91105
parser.add_argument("--benchmark", required=True, help="Benchmark to run (tpch or tpcds)")

0 commit comments

Comments
 (0)