Skip to content

Commit f271585

Browse files
committed
draft
1 parent 2fc0694 commit f271585

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed

tpch/README.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Benchmarking DataFusion Ray on Kubernetes
21+
22+
This is a rough guide to deploying and benchmarking DataFusion Ray on Kubernetes.
23+
24+
set up new venv
25+
26+
```shell
27+
python3 -m venv venv
28+
source venv/bin/activate
29+
pip3 install maturin
30+
pip3 install ray
31+
pip3 install ray[default]
32+
```
33+
34+
Build the project.
35+
36+
```shell
37+
maturin build --strip
38+
```
39+
40+
```yaml
41+
apiVersion: ray.io/v1alpha1
42+
kind: RayCluster
43+
metadata:
44+
name: datafusion-ray-cluster
45+
spec:
46+
headGroupSpec:
47+
rayStartParams:
48+
num-cpus: "0"
49+
template:
50+
spec:
51+
containers:
52+
- name: ray-head
53+
image: rayproject/ray:2.42.1-py310-cpu
54+
imagePullPolicy: Always
55+
resources:
56+
limits:
57+
cpu: 2
58+
memory: 8Gi
59+
requests:
60+
cpu: 2
61+
memory: 8Gi
62+
volumeMounts:
63+
- mountPath: /mnt/bigdata # Mount path inside the container
64+
name: ray-storage
65+
volumes:
66+
- name: ray-storage
67+
persistentVolumeClaim:
68+
claimName: ray-pvc
69+
workerGroupSpecs:
70+
- replicas: 2
71+
groupName: "datafusion-ray"
72+
rayStartParams:
73+
num-cpus: "4"
74+
template:
75+
spec:
76+
containers:
77+
- name: ray-worker
78+
image: rayproject/ray:2.42.1-py310-cpu
79+
imagePullPolicy: Always
80+
resources:
81+
limits:
82+
cpu: 5
83+
memory: 64Gi
84+
requests:
85+
cpu: 5
86+
memory: 64Gi
87+
volumeMounts:
88+
- mountPath: /mnt/bigdata
89+
name: ray-storage
90+
volumes:
91+
- name: ray-storage
92+
persistentVolumeClaim:
93+
claimName: ray-pvc
94+
```
95+
96+
```shell
97+
kubectl apply -f datafusion-ray.yaml
98+
```
99+
100+
set up port forwarding on head node 8265
101+
102+
```shell
103+
ray job submit --address='http://localhost:8265' \
104+
--runtime-env-json='{"pip":["datafusion", "tabulate", "boto3", "duckdb"], "py_modules":["/home/andy/git/apache/datafusion-ray/target/wheels/datafusion_ray-0.1.0-cp38-abi3-manylinux_2_35_x86_64.whl"], "working_dir":"./", "env_vars":{"RAY_DEDUP_LOGS":"O", "RAY_COLOR_PREFIX":"1"}}' -- \
105+
python tpcbench.py \
106+
--data /mnt/bigdata/tpch/sf100 \
107+
--concurrency 8 \
108+
--partitions-per-worker 4 \
109+
--worker-pool-min 30 \
110+
--listing-tables
111+
```

0 commit comments

Comments
 (0)