Skip to content

Commit 5ec9ae6

Browse files
andygroveclaude
andauthored
docs: fix outdated content in documentation (#1385)
* docs: fix outdated content in documentation - Remove outdated etcd references (etcd backend was removed) - Update version numbers from old versions to v51.0.0 - Fix executor-slots-policy to task-distribution with correct values - Remove Sled database references from docker.md - Update kubernetes.md docker tags and log output format - Fix Python API: Ballista() -> BallistaBuilder() - Fix scheduler-policy parameter name and default value Co-Authored-By: Claude Opus 4.5 <[email protected]> * chore: add CLAUDE.md to .gitignore Co-Authored-By: Claude Opus 4.5 <[email protected]> * style: format markdown with prettier Co-Authored-By: Claude Opus 4.5 <[email protected]> * docs: add benchmarking section to contributors guide Link to benchmarks/README.md for TPC-H and performance testing instructions. Co-Authored-By: Claude Opus 4.5 <[email protected]> --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
1 parent f30b81b commit 5ec9ae6

File tree

11 files changed

+53
-53
lines changed

11 files changed

+53
-53
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,6 @@ dev/dist
102102

103103
# logs
104104
logs/
105+
106+
# Claude Code guidance file (local only)
107+
CLAUDE.md

docs/developer/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ The scheduler process implements a gRPC interface (defined in
6060
| GetJobStatus | Get the status of a submitted query |
6161
| RegisterExecutor | Executors call this method to register themselves with the scheduler |
6262

63-
The scheduler can run in standalone mode, or can be run in clustered mode using etcd as backing store for state.
63+
The scheduler currently uses in-memory state storage.
6464

6565
## Executor Process
6666

docs/source/contributors-guide/architecture.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,6 @@ plan or a SQL string. The scheduler then creates an execution graph, which conta
8080
stages (pipelines) that can be scheduled independently. This process is explained in detail in the Distributed
8181
Query Scheduling section of this guide.
8282

83-
It is possible to have multiple schedulers running with shared state in etcd, so that jobs can continue to run
84-
even if a scheduler process fails.
85-
8683
### Executor
8784

8885
The executor processes connect to a scheduler and poll for tasks to perform. These tasks are physical plans in

docs/source/contributors-guide/development.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,14 @@ cargo test
6565
cd examples
6666
cargo run --example standalone_sql --features=ballista/standalone
6767
```
68+
69+
## Benchmarking
70+
71+
For performance testing and benchmarking with TPC-H and other datasets, see the [benchmarks README](../../../benchmarks/README.md).
72+
73+
This includes instructions for:
74+
75+
- Generating TPC-H test data
76+
- Running benchmarks against DataFusion and Ballista
77+
- Comparing performance with Apache Spark
78+
- Running load tests

docs/source/user-guide/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ It is also possible to run the CLI in standalone mode, where it will create a sc
7171
```bash
7272
$ ballista-cli
7373

74-
Ballista CLI v8.0.0
74+
Ballista CLI v51.0.0
7575

7676
> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
7777
0 rows in set. Query took 0.001 seconds.

docs/source/user-guide/configs.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -96,14 +96,13 @@ manage the whole cluster are also needed to be taken care of.
9696
_Example: Specifying configuration options when starting the scheduler_
9797

9898
```shell
99-
./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size 1000000 --executor-slots-policy
100-
round-robin-local
99+
./ballista-scheduler --scheduler-policy push-staged --event-loop-buffer-size 1000000 --task-distribution round-robin
101100
```
102101

103-
| key | type | default | description |
104-
| -------------------------------------------- | ------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
105-
| scheduler-policy | Utf8 | pull-staged | Sets the task scheduling policy for the scheduler, possible values: pull-staged, push-staged. |
106-
| event-loop-buffer-size | UInt32 | 10000 | Sets the event loop buffer size. for a system of high throughput, a larger value like 1000000 is recommended. |
107-
| executor-slots-policy | Utf8 | bias | Sets the executor slots policy for the scheduler, possible values: bias, round-robin, round-robin-local. For a cluster with single scheduler, round-robin-local is recommended. |
108-
| finished-job-data-clean-up-interval-seconds | UInt64 | 300 | Sets the delayed interval for cleaning up finished job data, mainly the shuffle data, 0 means the cleaning up is disabled. |
109-
| finished-job-state-clean-up-interval-seconds | UInt64 | 3600 | Sets the delayed interval for cleaning up finished job state stored in the backend, 0 means the cleaning up is disabled. |
102+
| key | type | default | description |
103+
| -------------------------------------------- | ------ | ----------- | -------------------------------------------------------------------------------------------------------------------------- |
104+
| scheduler-policy | Utf8 | pull-staged | Sets the task scheduling policy for the scheduler, possible values: pull-staged, push-staged. |
105+
| event-loop-buffer-size | UInt32 | 10000 | Sets the event loop buffer size. for a system of high throughput, a larger value like 1000000 is recommended. |
106+
| task-distribution | Utf8 | bias | Sets the task distribution policy for the scheduler, possible values: bias, round-robin, consistent-hash. |
107+
| finished-job-data-clean-up-interval-seconds | UInt64 | 300 | Sets the delayed interval for cleaning up finished job data, mainly the shuffle data, 0 means the cleaning up is disabled. |
108+
| finished-job-state-clean-up-interval-seconds | UInt64 | 3600 | Sets the delayed interval for cleaning up finished job state stored in the backend, 0 means the cleaning up is disabled. |

docs/source/user-guide/deployment/docker-compose.md

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,11 @@ This should show output similar to the following:
3939
```bash
4040
$ docker-compose up
4141
Creating network "ballista-benchmarks_default" with the default driver
42-
Creating ballista-benchmarks_etcd_1 ... done
4342
Creating ballista-benchmarks_ballista-scheduler_1 ... done
4443
Creating ballista-benchmarks_ballista-executor_1 ... done
45-
Attaching to ballista-benchmarks_etcd_1, ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1
46-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Running with config:
47-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] work_dir: /tmp/.tmpLVx39c
48-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] concurrent_tasks: 4
49-
ballista-scheduler_1 | [2021-08-28T15:55:22Z INFO ballista_scheduler] Ballista v0.12.0 Scheduler listening on 0.0.0.0:50050
50-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Ballista v0.12.0 Rust Executor listening on 0.0.0.0:50051
44+
Attaching to ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1
45+
ballista-scheduler_1 | INFO ballista_scheduler: Ballista v51.0.0 Scheduler listening on 0.0.0.0:50050
46+
ballista-executor_1 | INFO ballista_executor: Ballista v51.0.0 Rust Executor listening on 0.0.0.0:50051
5147
```
5248

5349
The scheduler listens on port 50050 and this is the port that clients will need to connect to.

docs/source/user-guide/deployment/docker.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,10 @@ Run `docker logs CONTAINER_ID` to check the output from the process:
6767

6868
```
6969
$ docker logs a756055576f3
70-
2024-02-03T14:49:47.904571Z INFO main ThreadId(01) ballista_scheduler::cluster: Initializing Sled database in temp directory
71-
72-
2024-02-03T14:49:47.924679Z INFO main ThreadId(01) ballista_scheduler::scheduler_process: Ballista v0.12.0 Scheduler listening on 0.0.0.0:50050
73-
2024-02-03T14:49:47.924709Z INFO main ThreadId(01) ballista_scheduler::scheduler_process: Starting Scheduler grpc server with task scheduling policy of PullStaged
74-
2024-02-03T14:49:47.925261Z INFO main ThreadId(01) ballista_scheduler::cluster::kv: Initializing heartbeat listener
75-
2024-02-03T14:49:47.925476Z INFO main ThreadId(01) ballista_scheduler::scheduler_server::query_stage_scheduler: Starting QueryStageScheduler
76-
2024-02-03T14:49:47.925587Z INFO tokio-runtime-worker ThreadId(47) ballista_core::event_loop: Starting the event loop query_stage
70+
INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler listening on 0.0.0.0:50050
71+
INFO ballista_scheduler::scheduler_process: Starting Scheduler grpc server with task scheduling policy of PullStaged
72+
INFO ballista_scheduler::scheduler_server::query_stage_scheduler: Starting QueryStageScheduler
73+
INFO ballista_core::event_loop: Starting the event loop query_stage
7774
```
7875

7976
### Start Executors
@@ -99,11 +96,11 @@ Use `docker logs CONTAINER_ID` to check the output from the executor(s):
9996

10097
```
10198
$ docker logs fb8b530cee6d
102-
2024-02-03T14:50:24.061607Z INFO main ThreadId(01) ballista_executor::executor_process: Running with config:
103-
2024-02-03T14:50:24.061649Z INFO main ThreadId(01) ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
104-
2024-02-03T14:50:24.061655Z INFO main ThreadId(01) ballista_executor::executor_process: concurrent_tasks: 48
105-
2024-02-03T14:50:24.063256Z INFO tokio-runtime-worker ThreadId(44) ballista_executor::executor_process: Ballista v0.12.0 Rust Executor Flight Server listening on 0.0.0.0:50051
106-
2024-02-03T14:50:24.063281Z INFO tokio-runtime-worker ThreadId(47) ballista_executor::execution_loop: Starting poll work loop with scheduler
99+
INFO ballista_executor::executor_process: Running with config:
100+
INFO ballista_executor::executor_process: work_dir: /tmp/.tmpAkP3pZ
101+
INFO ballista_executor::executor_process: concurrent_tasks: 48
102+
INFO ballista_executor::executor_process: Ballista v51.0.0 Rust Executor Flight Server listening on 0.0.0.0:50051
103+
INFO ballista_executor::execution_loop: Starting poll work loop with scheduler
107104
```
108105

109106
## Connect from the CLI

docs/source/user-guide/deployment/kubernetes.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -48,10 +48,10 @@ To create the required Docker images please refer to the [docker deployment page
4848
Once the images have been built, you can retag them and can push them to your favourite Docker registry.
4949

5050
```bash
51-
docker tag apache/datafusion-ballista-scheduler:0.12.0 <your-repo>/datafusion-ballista-scheduler:0.12.0
52-
docker tag apache/datafusion-ballista-executor:0.12.0 <your-repo>/datafusion-ballista-executor:0.12.0
53-
docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
54-
docker push <your-repo>/datafusion-ballista-executor:0.12.0
51+
docker tag apache/datafusion-ballista-scheduler:latest <your-repo>/datafusion-ballista-scheduler:latest
52+
docker tag apache/datafusion-ballista-executor:latest <your-repo>/datafusion-ballista-executor:latest
53+
docker push <your-repo>/datafusion-ballista-scheduler:latest
54+
docker push <your-repo>/datafusion-ballista-executor:latest
5555
```
5656

5757
## Create Persistent Volume and Persistent Volume Claim
@@ -139,7 +139,7 @@ spec:
139139
spec:
140140
containers:
141141
- name: ballista-scheduler
142-
image: <your-repo>/datafusion-ballista-scheduler:0.12.0
142+
image: <your-repo>/datafusion-ballista-scheduler:latest
143143
args: ["--bind-port=50050"]
144144
ports:
145145
- containerPort: 50050
@@ -169,7 +169,7 @@ spec:
169169
spec:
170170
containers:
171171
- name: ballista-executor
172-
image: <your-repo>/datafusion-ballista-executor:0.12.0
172+
image: <your-repo>/datafusion-ballista-executor:latest
173173
args:
174174
- "--bind-port=50051"
175175
- "--scheduler-host=ballista-scheduler"
@@ -208,13 +208,13 @@ ballista-executor-78cc5b6486-7crdm 0/1 Pending 0 42s
208208
ballista-scheduler-879f874c5-rnbd6 0/1 Pending 0 42s
209209
```
210210

211-
You can view the scheduler logs with `kubectl logs ballista-scheduler-0`:
211+
You can view the scheduler logs with `kubectl logs ballista-scheduler-<pod-id>`:
212212

213213
```
214-
$ kubectl logs ballista-scheduler-0
215-
[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.7.0 Scheduler listening on 0.0.0.0:50050
216-
[2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
217-
[2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
214+
$ kubectl logs ballista-scheduler-<pod-id>
215+
INFO ballista_scheduler::scheduler_process: Ballista v51.0.0 Scheduler listening on 0.0.0.0:50050
216+
INFO ballista_scheduler::scheduler_server::grpc: Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
217+
INFO ballista_scheduler::scheduler_server::grpc: Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
218218
```
219219

220220
## Port Forwarding

docs/source/user-guide/python.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -117,12 +117,8 @@ The following example demonstrates creating arrays with PyArrow and then creatin
117117
from ballista import BallistaBuilder
118118
import pyarrow
119119

120-
# an alias
121-
# TODO implement Functions
122-
f = ballista.functions
123-
124120
# create a context
125-
ctx = Ballista().standalone()
121+
ctx = BallistaBuilder().standalone()
126122

127123
# create a RecordBatch and a new DataFrame from it
128124
batch = pyarrow.RecordBatch.from_arrays(
@@ -132,9 +128,10 @@ batch = pyarrow.RecordBatch.from_arrays(
132128
df = ctx.create_dataframe([[batch]])
133129

134130
# create a new statement
131+
from datafusion import col
135132
df = df.select(
136-
f.col("a") + f.col("b"),
137-
f.col("a") - f.col("b"),
133+
col("a") + col("b"),
134+
col("a") - col("b"),
138135
)
139136

140137
# execute and collect the first (and only) batch

0 commit comments

Comments
 (0)