Skip to content

Commit c7bb151

Browse files
committed
Update root README.md adding example how to use ...
and cleanup some of the context. Relates to #1105
1 parent a542608 commit c7bb151

File tree

2 files changed

+117
-68
lines changed

2 files changed

+117
-68
lines changed

README.md

Lines changed: 49 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -17,36 +17,63 @@
1717
under the License.
1818
-->
1919

20-
# Ballista: Distributed SQL Query Engine, built on Apache Arrow
20+
# Ballista: Making DataFusion Applications Distributed
2121

22-
Ballista is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow][arrow] and
23-
[Apache Arrow DataFusion][datafusion].
22+
Ballista is a library which makes [Apache DataFusion](https://github.com/apache/datafusion) applications distributed.
2423

25-
If you are looking for documentation for a released version of Ballista, please refer to the
26-
[Ballista User Guide][user-guide].
24+
Existing DataFusion application:
2725

28-
## Overview
26+
```rust
27+
use datafusion::prelude::*;
2928

30-
Ballista implements a similar design to Apache Spark (particularly Spark SQL), but there are some key differences:
29+
#[tokio::main]
30+
async fn main() -> datafusion::error::Result<()> {
31+
let ctx = SessionContext::new();
3132

32-
- The choice of Rust as the main execution language avoids the overhead of GC pauses and results in deterministic
33-
processing times.
34-
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
35-
processing (SIMD) and efficient compression. Although Spark does have some columnar support, it is still
36-
largely row-based today.
37-
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than
38-
Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of
39-
distributed compute.
40-
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged efficiently between
41-
executors using the [Flight Protocol][flight], and between clients and schedulers/executors using the
42-
[Flight SQL Protocol][flight-sql]
33+
// register the table
34+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
35+
36+
// create a plan to run a SQL query
37+
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
38+
39+
// execute and print results
40+
df.show().await?;
41+
Ok(())
42+
}
43+
```
44+
45+
can be distributed with few lines of code changed:
46+
47+
```rust
48+
// add ballista imports
49+
use ballista::prelude::*;
50+
use datafusion::prelude::*;
51+
52+
53+
#[tokio::main]
54+
async fn main() -> datafusion::error::Result<()> {
55+
// create DataFusion SessionContext with ballista standalone cluster started
56+
let ctx = SessionContext::standalone();
57+
58+
// register the table
59+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
60+
61+
// create a plan to run a SQL query
62+
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
63+
64+
// execute and print results
65+
df.show().await?;
66+
Ok(())
67+
}
68+
```
69+
70+
If you are looking for documentation or more examples, please refer to the [Ballista User Guide][user-guide].
4371

4472
## Architecture
4573

4674
A Ballista cluster consists of one or more scheduler processes and one or more executor processes. These processes
4775
can be run as native binaries and are also available as Docker Images, which can be easily deployed with
48-
[Docker Compose](https://datafusion.apache.org/ballista/user-guide/deployment/docker-compose.html) or
49-
[Kubernetes](https://datafusion.apache.org/ballista/user-guide/deployment/kubernetes.html).
76+
[Docker Compose](https://datafusion.apache.org/ballista/user-guide/deployment/docker-compose.html).
5077

5178
The following diagram shows the interaction between clients and the scheduler for submitting jobs, and the interaction
5279
between the executor(s) and the scheduler for fetching tasks and reporting task status.
@@ -55,15 +82,6 @@ between the executor(s) and the scheduler for fetching tasks and reporting task
5582

5683
See the [architecture guide](docs/source/contributors-guide/architecture.md) for more details.
5784

58-
## Features
59-
60-
- Supports cloud object stores. S3 is supported today and GCS and Azure support is planned.
61-
- DataFrame and SQL APIs available from Python and Rust.
62-
- Clients can connect to a Ballista cluster using [Flight SQL][flight-sql].
63-
- JDBC support via Arrow Flight SQL JDBC Driver
64-
- Scheduler REST UI for monitoring query progress and viewing query plans and metrics.
65-
- Support for Docker, Docker Compose, and Kubernetes deployment, as well as manual deployment on bare metal.
66-
6785
## Performance
6886

6987
We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.
@@ -81,19 +99,14 @@ that, refer to the [Getting Started Guide](ballista/client/README.md).
8199

82100
## Project Status
83101

84-
Ballista supports a wide range of SQL, including CTEs, Joins, and Subqueries and can execute complex queries at scale.
102+
Ballista supports a wide range of SQL, including CTEs, Joins, and subqueries and can execute complex queries at scale,
103+
but still there is a gap between DataFusion and Ballista which we want to bridge in near future.
85104

86105
Refer to the [DataFusion SQL Reference](https://datafusion.apache.org/user-guide/sql/index.html) for more
87106
information on supported SQL.
88107

89-
Ballista is maturing quickly and is now working towards being production ready. See the [roadmap](ROADMAP.md) for more details.
90-
91108
## Contribution Guide
92109

93110
Please see the [Contribution Guide](CONTRIBUTING.md) for information about contributing to Ballista.
94111

95-
[arrow]: https://arrow.apache.org/
96-
[datafusion]: https://github.com/apache/arrow-datafusion
97-
[flight]: https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
98-
[flight-sql]: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/
99112
[user-guide]: https://datafusion.apache.org/ballista/

examples/README.md

Lines changed: 68 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222
This directory contains examples for executing distributed queries with Ballista.
2323

24-
# Standalone Examples
24+
## Standalone Examples
2525

2626
The standalone example is the easiest to get started with. Ballista supports a standalone mode where a scheduler
2727
and executor are started in-process.
@@ -33,18 +33,35 @@ cargo run --example standalone_sql --features="ballista/standalone"
3333
### Source code for standalone SQL example
3434

3535
```rust
36+
use ballista::{
37+
extension::SessionConfigExt,
38+
prelude::*
39+
};
40+
use datafusion::{
41+
execution::{options::ParquetReadOptions, SessionStateBuilder},
42+
prelude::{SessionConfig, SessionContext},
43+
};
44+
3645
#[tokio::main]
3746
async fn main() -> Result<()> {
38-
let config = BallistaConfig::builder()
39-
.set("ballista.shuffle.partitions", "1")
40-
.build()?;
47+
let config = SessionConfig::new_with_ballista()
48+
.with_target_partitions(1)
49+
.with_ballista_standalone_parallelism(2);
4150

42-
let ctx = BallistaContext::standalone(&config, 2).await?;
51+
let state = SessionStateBuilder::new()
52+
.with_config(config)
53+
.with_default_features()
54+
.build();
4355

44-
ctx.register_csv(
56+
let ctx = SessionContext::standalone_with_state(state).await?;
57+
58+
let test_data = test_util::examples_test_data();
59+
60+
// register parquet file with the execution context
61+
ctx.register_parquet(
4562
"test",
46-
"testdata/aggregate_test_100.csv",
47-
CsvReadOptions::new(),
63+
&format!("{test_data}/alltypes_plain.parquet"),
64+
ParquetReadOptions::default(),
4865
)
4966
.await?;
5067

@@ -56,12 +73,12 @@ async fn main() -> Result<()> {
5673

5774
```
5875

59-
# Distributed Examples
76+
## Distributed Examples
6077

6178
For background information on the Ballista architecture, refer to
6279
the [Ballista README](../ballista/client/README.md).
6380

64-
## Start a standalone cluster
81+
### Start a standalone cluster
6582

6683
From the root of the project, build release binaries.
6784

@@ -83,40 +100,49 @@ RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50051
83100
RUST_LOG=info ./target/release/ballista-executor -c 2 -p 50052
84101
```
85102

86-
## Running the examples
103+
### Running the examples
87104

88105
The examples can be run using the `cargo run --bin` syntax.
89106

90-
## Distributed SQL Example
107+
### Distributed SQL Example
91108

92109
```bash
93110
cargo run --release --example remote-sql
94111
```
95112

96-
### Source code for distributed SQL example
113+
#### Source code for distributed SQL example
97114

98115
```rust
99-
use ballista::prelude::*;
100-
use datafusion::prelude::CsvReadOptions;
116+
use ballista::{extension::SessionConfigExt, prelude::*};
117+
use datafusion::{
118+
execution::SessionStateBuilder,
119+
prelude::{CsvReadOptions, SessionConfig, SessionContext},
120+
};
101121

102122
/// This example demonstrates executing a simple query against an Arrow data source (CSV) and
103123
/// fetching results, using SQL
104124
#[tokio::main]
105125
async fn main() -> Result<()> {
106-
let config = BallistaConfig::builder()
107-
.set("ballista.shuffle.partitions", "4")
108-
.build()?;
109-
let ctx = BallistaContext::remote("localhost", 50050, &config).await?;
126+
let config = SessionConfig::new_with_ballista()
127+
.with_target_partitions(4)
128+
.with_ballista_job_name("Remote SQL Example");
129+
130+
let state = SessionStateBuilder::new()
131+
.with_config(config)
132+
.with_default_features()
133+
.build();
134+
135+
let ctx = SessionContext::remote_with_state("df://localhost:50050", state).await?;
136+
137+
let test_data = test_util::examples_test_data();
110138

111-
// register csv file with the execution context
112139
ctx.register_csv(
113140
"test",
114-
"testdata/aggregate_test_100.csv",
141+
&format!("{test_data}/aggregate_test_100.csv"),
115142
CsvReadOptions::new(),
116143
)
117144
.await?;
118145

119-
// execute the query
120146
let df = ctx
121147
.sql(
122148
"SELECT c1, MIN(c12), MAX(c12) \
@@ -126,39 +152,49 @@ async fn main() -> Result<()> {
126152
)
127153
.await?;
128154

129-
// print the results
130155
df.show().await?;
131156

132157
Ok(())
133158
}
134159
```
135160

136-
## Distributed DataFrame Example
161+
### Distributed DataFrame Example
137162

138163
```bash
139164
cargo run --release --example remote-dataframe
140165
```
141166

142-
### Source code for distributed DataFrame example
167+
#### Source code for distributed DataFrame example
143168

144169
```rust
170+
use ballista::{extension::SessionConfigExt, prelude::*};
171+
use datafusion::{
172+
execution::SessionStateBuilder,
173+
prelude::{col, lit, ParquetReadOptions, SessionConfig, SessionContext},
174+
};
175+
176+
/// This example demonstrates executing a simple query against an Arrow data source (Parquet) and
177+
/// fetching results, using the DataFrame trait
145178
#[tokio::main]
146179
async fn main() -> Result<()> {
147-
let config = BallistaConfig::builder()
148-
.set("ballista.shuffle.partitions", "4")
149-
.build()?;
150-
let ctx = BallistaContext::remote("localhost", 50050, &config).await?;
180+
let config = SessionConfig::new_with_ballista().with_target_partitions(4);
181+
182+
let state = SessionStateBuilder::new()
183+
.with_config(config)
184+
.with_default_features()
185+
.build();
186+
187+
let ctx = SessionContext::remote_with_state("df://localhost:50050", state).await?;
151188

152-
let filename = "testdata/alltypes_plain.parquet";
189+
let test_data = test_util::examples_test_data();
190+
let filename = format!("{test_data}/alltypes_plain.parquet");
153191

154-
// define the query using the DataFrame trait
155192
let df = ctx
156193
.read_parquet(filename, ParquetReadOptions::default())
157194
.await?
158195
.select_columns(&["id", "bool_col", "timestamp_col"])?
159196
.filter(col("id").gt(lit(1)))?;
160197

161-
// print the results
162198
df.show().await?;
163199

164200
Ok(())

0 commit comments

Comments
 (0)