Skip to content

Commit 68b2277

Browse files
authored
Update root README.md and other documentation with latest changes (#1113)
* Update root `README.md` adding example how to use ... and cleanup some of the context. Relates to #1105 * add new architectural diagram * address code review comments * address review comments
1 parent 8928a70 commit 68b2277

File tree

11 files changed

+316
-164
lines changed

11 files changed

+316
-164
lines changed

README.md

Lines changed: 50 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -17,29 +17,58 @@
1717
under the License.
1818
-->
1919

20-
# Ballista: Distributed SQL Query Engine, built on Apache Arrow
20+
# Ballista: Making DataFusion Applications Distributed
2121

22-
Ballista is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow][arrow] and
23-
[Apache Arrow DataFusion][datafusion].
22+
Ballista is a distributed execution engine which makes [Apache DataFusion](https://github.com/apache/datafusion) applications distributed.
2423

25-
If you are looking for documentation for a released version of Ballista, please refer to the
26-
[Ballista User Guide][user-guide].
24+
Existing DataFusion application:
2725

28-
## Overview
26+
```rust
27+
use datafusion::prelude::*;
2928

30-
Ballista implements a similar design to Apache Spark (particularly Spark SQL), but there are some key differences:
29+
#[tokio::main]
30+
async fn main() -> datafusion::error::Result<()> {
31+
let ctx = SessionContext::new();
3132

32-
- The choice of Rust as the main execution language avoids the overhead of GC pauses and results in deterministic
33-
processing times.
34-
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
35-
processing (SIMD) and efficient compression. Although Spark does have some columnar support, it is still
36-
largely row-based today.
37-
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than
38-
Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of
39-
distributed compute.
40-
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged efficiently between
41-
executors using the [Flight Protocol][flight], and between clients and schedulers/executors using the
42-
[Flight SQL Protocol][flight-sql]
33+
// register the table
34+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
35+
36+
// create a plan to run a SQL query
37+
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
38+
39+
// execute and print results
40+
df.show().await?;
41+
Ok(())
42+
}
43+
```
44+
45+
can be distributed with few lines of code changed:
46+
47+
> [!IMPORTANT]
48+
> There is a gap between DataFusion and Ballista, which may bring incompatibilities. The community is working hard to close this gap
49+
50+
```rust
51+
use ballista::prelude::*;
52+
use datafusion::prelude::*;
53+
54+
#[tokio::main]
55+
async fn main() -> datafusion::error::Result<()> {
56+
// create DataFusion SessionContext with ballista standalone cluster started
57+
let ctx = SessionContext::standalone();
58+
59+
// register the table
60+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
61+
62+
// create a plan to run a SQL query
63+
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
64+
65+
// execute and print results
66+
df.show().await?;
67+
Ok(())
68+
}
69+
```
70+
71+
If you are looking for documentation or more examples, please refer to the [Ballista User Guide][user-guide].
4372

4473
## Architecture
4574

@@ -51,19 +80,10 @@ can be run as native binaries and are also available as Docker Images, which can
5180
The following diagram shows the interaction between clients and the scheduler for submitting jobs, and the interaction
5281
between the executor(s) and the scheduler for fetching tasks and reporting task status.
5382

54-
![Ballista Cluster Diagram](docs/source/contributors-guide/ballista.drawio.png)
83+
![Ballista Cluster Diagram](docs/source/contributors-guide/ballista_architecture.excalidraw.svg)
5584

5685
See the [architecture guide](docs/source/contributors-guide/architecture.md) for more details.
5786

58-
## Features
59-
60-
- Supports cloud object stores. S3 is supported today and GCS and Azure support is planned.
61-
- DataFrame and SQL APIs available from Python and Rust.
62-
- Clients can connect to a Ballista cluster using [Flight SQL][flight-sql].
63-
- JDBC support via Arrow Flight SQL JDBC Driver
64-
- Scheduler REST UI for monitoring query progress and viewing query plans and metrics.
65-
- Support for Docker, Docker Compose, and Kubernetes deployment, as well as manual deployment on bare metal.
66-
6787
## Performance
6888

6989
We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.
@@ -81,19 +101,14 @@ that, refer to the [Getting Started Guide](ballista/client/README.md).
81101

82102
## Project Status
83103

84-
Ballista supports a wide range of SQL, including CTEs, Joins, and Subqueries and can execute complex queries at scale.
104+
Ballista supports a wide range of SQL, including CTEs, Joins, and subqueries and can execute complex queries at scale,
105+
but still there is a gap between DataFusion and Ballista which we want to bridge in near future.
85106

86107
Refer to the [DataFusion SQL Reference](https://datafusion.apache.org/user-guide/sql/index.html) for more
87108
information on supported SQL.
88109

89-
Ballista is maturing quickly and is now working towards being production ready. See the [roadmap](ROADMAP.md) for more details.
90-
91110
## Contribution Guide
92111

93112
Please see the [Contribution Guide](CONTRIBUTING.md) for information about contributing to Ballista.
94113

95-
[arrow]: https://arrow.apache.org/
96-
[datafusion]: https://github.com/apache/arrow-datafusion
97-
[flight]: https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
98-
[flight-sql]: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
99114
[user-guide]: https://datafusion.apache.org/ballista/

docs/source/contributors-guide/architecture.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ can be run as native binaries and are also available as Docker Images, which can
6565
The following diagram shows the interaction between clients and the scheduler for submitting jobs, and the interaction
6666
between the executor(s) and the scheduler for fetching tasks and reporting task status.
6767

68-
![Ballista Cluster Diagram](ballista.drawio.png)
68+
![Ballista Cluster Diagram](ballista_architecture.excalidraw.svg)
6969

7070
### Scheduler
7171

docs/source/contributors-guide/ballista_architecture.excalidraw.svg

Lines changed: 11 additions & 0 deletions
Loading

docs/source/user-guide/configs.md

Lines changed: 60 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -19,46 +19,74 @@
1919

2020
# Configuration
2121

22-
## BallistaContext Configuration Settings
22+
## Ballista Configuration Settings
2323

24-
Ballista has a number of configuration settings that can be specified when creating a BallistaContext.
24+
Configuring Ballista is quite similar to configuring DataFusion. Most settings are identical, with only a few configurations specific to Ballista.
2525

2626
_Example: Specifying configuration options when creating a context_
2727

2828
```rust
29-
let config = BallistaConfig::builder()
30-
.set("ballista.shuffle.partitions", "200")
31-
.set("ballista.batch.size", "16384")
32-
.build() ?;
29+
use ballista::extension::{SessionConfigExt, SessionContextExt};
3330

34-
let ctx = BallistaContext::remote("localhost", 50050, & config).await?;
31+
let session_config = SessionConfig::new_with_ballista()
32+
.with_information_schema(true)
33+
.with_ballista_job_name("Super Cool Ballista App");
34+
35+
let state = SessionStateBuilder::new()
36+
.with_default_features()
37+
.with_config(session_config)
38+
.build();
39+
40+
let ctx: SessionContext = SessionContext::remote_with_state(&url,state).await?;
41+
```
42+
43+
`SessionConfig::new_with_ballista()` will setup `SessionConfig` for use with ballista. This is not required, `SessionConfig::new` could be used, but it's advised as it will set up some sensible configuration defaults .
44+
45+
`SessionConfigExt` expose set of `SessionConfigExt::with_ballista_` and `SessionConfigExt::ballista_` methods which can tune retrieve ballista specific options.
46+
47+
Notable `SessionConfigExt` configuration methods would be:
48+
49+
```rust
50+
/// Overrides ballista's [LogicalExtensionCodec]
51+
fn with_ballista_logical_extension_codec(
52+
self,
53+
codec: Arc<dyn LogicalExtensionCodec>,
54+
) -> SessionConfig;
55+
56+
/// Overrides ballista's [PhysicalExtensionCodec]
57+
fn with_ballista_physical_extension_codec(
58+
self,
59+
codec: Arc<dyn PhysicalExtensionCodec>,
60+
) -> SessionConfig;
61+
62+
/// Overrides ballista's [QueryPlanner]
63+
fn with_ballista_query_planner(
64+
self,
65+
planner: Arc<dyn QueryPlanner + Send + Sync + 'static>,
66+
) -> SessionConfig;
3567
```
3668

37-
### Ballista Configuration Settings
38-
39-
| key | type | default | description |
40-
| --------------------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
41-
| ballista.job.name | Utf8 | N/A | Sets the job name that will appear in the web user interface for any submitted jobs. |
42-
| ballista.shuffle.partitions | UInt16 | 16 | Sets the default number of partitions to create when repartitioning query stages. |
43-
| ballista.batch.size | UInt16 | 8192 | Sets the default batch size. |
44-
| ballista.repartition.joins | Boolean | true | When set to true, Ballista will repartition data using the join keys to execute joins in parallel using the provided `ballista.shuffle.partitions` level. |
45-
| ballista.repartition.aggregations | Boolean | true | When set to true, Ballista will repartition data using the aggregate keys to execute aggregates in parallel using the provided `ballista.shuffle.partitions` level. |
46-
| ballista.repartition.windows | Boolean | true | When set to true, Ballista will repartition data using the partition keys to execute window functions in parallel using the provided `ballista.shuffle.partitions` level. |
47-
| ballista.parquet.pruning | Boolean | true | Determines whether Parquet pruning should be enabled or not. |
48-
| ballista.with_information_schema | Boolean | true | Determines whether the `information_schema` should be created in the context. This is necessary for supporting DDL commands such as `SHOW TABLES`. |
49-
50-
### DataFusion Configuration Settings
51-
52-
In addition to Ballista-specific configuration settings, the following DataFusion settings can also be specified.
53-
54-
| key | type | default | description |
55-
| ----------------------------------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
56-
| datafusion.execution.coalesce_batches | Boolean | true | When set to true, record batches will be examined between each operator and small batches will be coalesced into larger batches. This is helpful when there are highly selective filters or joins that could produce tiny output batches. The target batch size is determined by the configuration setting 'datafusion.execution.coalesce_target_batch_size'. |
57-
| datafusion.execution.coalesce_target_batch_size | UInt64 | 4096 | Target batch size when coalescing batches. Uses in conjunction with the configuration setting 'datafusion.execution.coalesce_batches'. |
58-
| datafusion.explain.logical_plan_only | Boolean | false | When set to true, the explain statement will only print logical plans. |
59-
| datafusion.explain.physical_plan_only | Boolean | false | When set to true, the explain statement will only print physical plans. |
60-
| datafusion.optimizer.filter_null_join_keys | Boolean | false | When set to true, the optimizer will insert filters before a join between a nullable and non-nullable column to filter out nulls on the nullable side. This filter can add additional overhead when the file format does not fully support predicate push down. |
61-
| datafusion.optimizer.skip_failed_rules | Boolean | true | When set to true, the logical plan optimizer will produce warning messages if any optimization rules produce errors and then proceed to the next rule. When set to false, any rules that produce errors will cause the query to fail. |
69+
which could be used to change default ballista behavior.
70+
71+
If information schema is enabled all configuration parameters could be retrieved or set using SQL;
72+
73+
```rust
74+
let ctx: SessionContext = SessionContext::remote_with_state(&url, state).await?;
75+
76+
let result = ctx
77+
.sql("select name, value from information_schema.df_settings where name like 'ballista'")
78+
.await?
79+
.collect()
80+
.await?;
81+
82+
let expected = [
83+
"+-------------------+-------------------------+",
84+
"| name | value |",
85+
"+-------------------+-------------------------+",
86+
"| ballista.job.name | Super Cool Ballista App |",
87+
"+-------------------+-------------------------+",
88+
];
89+
```
6290

6391
## Ballista Scheduler Configuration Settings
6492

docs/source/user-guide/deployment/quick-start.md

Lines changed: 34 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -19,16 +19,14 @@
1919

2020
# Ballista Quickstart
2121

22-
A simple way to start a local cluster for testing purposes is to use cargo to build the project and then run the scheduler and executor binaries directly along with the Ballista UI.
22+
A simple way to start a local cluster for testing purposes is to use cargo to build the project and then run the scheduler and executor binaries directly.
2323

2424
Project Requirements:
2525

2626
- [Rust](https://www.rust-lang.org/tools/install)
2727
- [Protobuf Compiler](https://protobuf.dev/downloads/)
28-
- [Node.js](https://nodejs.org/en/download)
29-
- [Yarn](https://classic.yarnpkg.com/lang/en/docs/install)
3028

31-
### Build the project
29+
## Build the project
3230

3331
From the root of the project, build release binaries.
3432

@@ -55,39 +53,47 @@ RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50052
5553

5654
The examples can be run using the `cargo run --bin` syntax. Open a new terminal session and run the following commands.
5755

58-
## Running the examples
59-
60-
## Distributed SQL Example
56+
### Distributed SQL Example
6157

6258
```bash
6359
cd examples
6460
cargo run --release --example remote-sql
6561
```
6662

67-
### Source code for distributed SQL example
63+
#### Source code for distributed SQL example
6864

6965
```rust
7066
use ballista::prelude::*;
71-
use datafusion::prelude::CsvReadOptions;
67+
use ballista_examples::test_util;
68+
use datafusion::{
69+
execution::SessionStateBuilder,
70+
prelude::{CsvReadOptions, SessionConfig, SessionContext},
71+
};
7272

7373
/// This example demonstrates executing a simple query against an Arrow data source (CSV) and
7474
/// fetching results, using SQL
7575
#[tokio::main]
7676
async fn main() -> Result<()> {
77-
let config = BallistaConfig::builder()
78-
.set("ballista.shuffle.partitions", "4")
79-
.build()?;
80-
let ctx = BallistaContext::remote("localhost", 50050, &config).await?;
77+
let config = SessionConfig::new_with_ballista()
78+
.with_target_partitions(4)
79+
.with_ballista_job_name("Remote SQL Example");
80+
81+
let state = SessionStateBuilder::new()
82+
.with_config(config)
83+
.with_default_features()
84+
.build();
85+
86+
let ctx = SessionContext::remote_with_state("df://localhost:50050", state).await?;
87+
88+
let test_data = test_util::examples_test_data();
8189

82-
// register csv file with the execution context
8390
ctx.register_csv(
8491
"test",
85-
"testdata/aggregate_test_100.csv",
92+
&format!("{test_data}/aggregate_test_100.csv"),
8693
CsvReadOptions::new(),
8794
)
8895
.await?;
8996

90-
// execute the query
9197
let df = ctx
9298
.sql(
9399
"SELECT c1, MIN(c12), MAX(c12) \
@@ -97,40 +103,42 @@ async fn main() -> Result<()> {
97103
)
98104
.await?;
99105

100-
// print the results
101106
df.show().await?;
102107

103108
Ok(())
104109
}
105110
```
106111

107-
## Distributed DataFrame Example
112+
### Distributed DataFrame Example
108113

109114
```bash
110115
cd examples
111116
cargo run --release --example remote-dataframe
112117
```
113118

114-
### Source code for distributed DataFrame example
119+
#### Source code for distributed DataFrame example
115120

116121
```rust
122+
use ballista::prelude::*;
123+
use ballista_examples::test_util;
124+
use datafusion::{
125+
prelude::{col, lit, ParquetReadOptions, SessionContext},
126+
};
127+
117128
#[tokio::main]
118129
async fn main() -> Result<()> {
119-
let config = BallistaConfig::builder()
120-
.set("ballista.shuffle.partitions", "4")
121-
.build()?;
122-
let ctx = BallistaContext::remote("localhost", 50050, &config).await?;
130+
// creating SessionContext with default settings
131+
let ctx = SessionContext::remote("df://localhost:50050".await?;
123132

124-
let filename = "testdata/alltypes_plain.parquet";
133+
let test_data = test_util::examples_test_data();
134+
let filename = format!("{test_data}/alltypes_plain.parquet");
125135

126-
// define the query using the DataFrame trait
127136
let df = ctx
128137
.read_parquet(filename, ParquetReadOptions::default())
129138
.await?
130139
.select_columns(&["id", "bool_col", "timestamp_col"])?
131140
.filter(col("id").gt(lit(1)))?;
132141

133-
// print the results
134142
df.show().await?;
135143

136144
Ok(())

docs/source/user-guide/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,4 @@ DataFusion is a library for executing queries in-process using the Apache Arrow
2525
model and computational kernels. It is designed to run within a single process, using threads
2626
for parallel query execution.
2727

28-
Ballista is a distributed compute platform built on DataFusion.
28+
Ballista is a distributed compute platform for DataFusion workloads.

docs/source/user-guide/flightsql.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
One of the easiest ways to start with Ballista is to plug it into your existing data infrastructure using support for Arrow Flight SQL JDBC.
2323

24+
> This is optional scheduler feature which should be enabled with `flight-sql` feature
25+
2426
Getting started involves these main steps:
2527

2628
1. [Installing prerequisites](#prereq)

docs/source/user-guide/metrics.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
## Prometheus
2323

24+
> This is optional scheduler feature which should be enabled with `prometheus-metrics` feature
25+
2426
Built with default features, the ballista scheduler will automatically collect and expose a standard set of prometheus metrics.
2527
The metrics currently collected automatically include:
2628

0 commit comments

Comments
 (0)