You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing.md
+41-7Lines changed: 41 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,18 +21,52 @@
21
21
22
22
## Building
23
23
24
-
To build DataFusion Ray, you will need rust installed, as well as [https://github.com/PyO3/maturin](maturin).
24
+
You'll need to have both rust and cargo installed.
25
25
26
-
Install maturin in your current python environment (a virtual environment is recommended), with
26
+
We will follow the development workflow outlined by [datafusion-python](https://github.com/apache/datafusion-python), [pyo3](https://github.com/PyO3/pyo3) and [maturin](https://github.com/PyO3/maturin).
27
+
28
+
The Maturin tools used in this workflow can be installed either via `uv` or `pip`. Both approaches should offer the same experience. It is recommended to use `uv` since it has significant performance improvements
29
+
over `pip`.
30
+
31
+
Bootstrap (`uv`):
32
+
33
+
By default `uv` will attempt to build the datafusion-ray python package. For our development we prefer to build manually. This means
34
+
that when creating your virtual environment using `uv sync` you need to pass in the additional `--no-install-package datafusion-ray`. This tells uv, to install all of the dependencies found in `pyproject.toml`, but skip building `datafusion-ray` as we'll do that manually.
# prepare development environment (used to build wheel / install in development)
55
+
python3 -m venv .venv
56
+
# activate the venv
57
+
source .venv/bin/activate
58
+
# update pip itself if necessary
59
+
python -m pip install -U pip
60
+
# install dependencies
61
+
python -m pip install -r pyproject.toml
30
62
```
31
63
32
-
Then build the project with the following command:
64
+
Whenever rust code changes (your changes or via `git pull`):
33
65
34
66
```bash
35
-
maturin develop # --release for a release build
67
+
# make sure you activate the venv using "source venv/bin/activate" first
68
+
maturin develop --uv
69
+
python -m pytest
36
70
```
37
71
38
72
## Example
@@ -57,14 +91,14 @@ For example, to execute the following query:
57
91
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --query 'select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name limit 1'
58
92
```
59
93
60
-
To further parallelize execution, you can choose how many partitions will be served by each Stage with `--partitions-per-worker`. If this number is less than `--concurrency` Then multiple Actors will host portions of the stage. For example, if there are 10 stages calculated for a query, `concurrency=16` and `partitions-per-worker=4`, then `40``RayStage` Actors will be created. If `partitions-per-worker=16` or is absent, then `10``RayStage` Actors will be created.
94
+
To further parallelize execution, you can choose how many partitions will be served by each Stage with `--partitions-per-processor`. If this number is less than `--concurrency` Then multiple Actors will host portions of the stage. For example, if there are 10 stages calculated for a query, `concurrency=16` and `partitions-per-processor=4`, then `40``RayStage` Actors will be created. If `partitions-per-processor=16` or is absent, then `10``RayStage` Actors will be created.
61
95
62
96
To validate the output against non-ray single node datafusion, add `--validate` which will ensure that both systems produce the same output.
0 commit comments