Skip to content

Commit 48b4764

Browse files
authored
Minor fixes to README (#64)
1. Remove a duplicate sentence. 2. Replace backquote of the sql argument with single quote (backquote in bash is for command substitution). 3. Since #62, the `--worker-pool-min` is a new arg without default value and need to be provided in the TPC example commands to run.
1 parent a15fdcc commit 48b4764

File tree

1 file changed

+3
-5
lines changed

1 file changed

+3
-5
lines changed

README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,6 @@ Then build the project with the following command:
6464
maturin develop # --release for a release build
6565
```
6666

67-
- In the `examples` directory, run
68-
6967
## Example
7068

7169
- In the `examples` directory, run
@@ -77,15 +75,15 @@ RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tips.py --data-dir=$(pwd)/../testdata
7775
- In the `tpch` directory, use `make_data.py` to create a TPCH dataset at a provided scale factor, then
7876

7977
```bash
80-
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --qnum 2
78+
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --qnum 2
8179
```
8280

8381
To execute the TPCH query #2. To execute an arbitrary query against the TPCH dataset, provide it with `--query` instead of `--qnum`. This is useful for validating plans that DataFusion Ray will create.
8482

8583
For example, to execute the following query:
8684

8785
```bash
88-
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --query `select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name limit 1`
86+
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpc.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 --query 'select c.c_name, sum(o.o_totalprice) as total from orders o inner join customer c on o.o_custkey = c.c_custkey group by c_name limit 1'
8987
```
9088

9189
To further parallelize execution, you can choose how many partitions will be served by each Stage with `--partitions-per-worker`. If this number is less than `--concurrency` Then multiple Actors will host portions of the stage. For example, if there are 10 stages calculated for a query, `concurrency=16` and `partitions-per-worker=4`, then `40` `RayStage` Actors will be created. If `partitions-per-worker=16` or is absent, then `10` `RayStage` Actors will be created.
@@ -95,7 +93,7 @@ To validate the output against non-ray single node datafusion, add `--validate`
9593
To run the entire TPCH benchmark use
9694

9795
```bash
98-
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 [--partitions-per-worker=] [--validate]
96+
RAY_COLOR_PREFIX=1 RAY_DEDUP_LOGS=0 python tpcbench.py --data=file:///path/to/your/tpch/directory/ --concurrency=2 --batch-size=8182 --worker-pool-min=10 [--partitions-per-worker=] [--validate]
9997
```
10098

10199
This will output a json file in the current directory with query timings.

0 commit comments

Comments
 (0)