Skip to content

Commit 35dcdd0

Browse files
committed
doc: update ballista client front page
... making it look more like root `README.md`
1 parent e9e8f9a commit 35dcdd0

File tree

2 files changed

+100
-55
lines changed

2 files changed

+100
-55
lines changed

README.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -60,18 +60,25 @@ use datafusion::prelude::*;
6060

6161
#[tokio::main]
6262
async fn main() -> datafusion::error::Result<()> {
63-
// create DataFusion SessionContext with ballista standalone cluster started
64-
let ctx = SessionContext::standalone();
63+
// create SessionContext with ballista support
64+
// standalone context will start all required
65+
// ballista infrastructure in the background as well
66+
let ctx = SessionContext::standalone().await?;
6567

66-
// register the table
67-
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
68+
// everything else remains the same
6869

69-
// create a plan to run a SQL query
70-
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
70+
// register the table
71+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new())
72+
.await?;
7173

72-
// execute and print results
73-
df.show().await?;
74-
Ok(())
74+
// create a plan to run a SQL query
75+
let df = ctx
76+
.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100")
77+
.await?;
78+
79+
// execute and print results
80+
df.show().await?;
81+
Ok(())
7582
}
7683
```
7784

ballista/client/README.md

Lines changed: 84 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -19,34 +19,69 @@
1919

2020
# Ballista: Distributed Scheduler for Apache Arrow DataFusion
2121

22-
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and
23-
DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and
24-
Java) to be supported as first-class citizens without paying a penalty for serialization costs.
22+
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.
2523

26-
The foundational technologies in Ballista are:
24+
![logo](https://github.com/apache/datafusion-ballista/blob/main/docs/source/_static/images/ballista-logo.png?raw=true)
2725

28-
- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
29-
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
30-
data transfer between processes.
31-
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
32-
- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
26+
Ballista is a distributed query execution engine that enhances [Apache DataFusion](https://github.com/apache/datafusion) by enabling the parallelized execution of workloads across multiple nodes in a distributed environment.
3327

34-
Ballista can be deployed as a standalone cluster and also supports [Kubernetes](https://kubernetes.io/). In either
35-
case, the scheduler can be configured to use [etcd](https://etcd.io/) as a backing store to (eventually) provide
36-
redundancy in the case of a scheduler failing.
28+
Existing DataFusion application:
3729

38-
## Rust Version Compatibility
30+
```rust,no_run
31+
use datafusion::prelude::*;
32+
33+
#[tokio::main]
34+
async fn main() -> datafusion::error::Result<()> {
35+
// datafusion context
36+
let ctx = SessionContext::new();
3937
40-
This crate is tested with the latest stable version of Rust. We do not currrently test against other, older versions of the Rust compiler.
38+
// register the table
39+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?;
40+
41+
// create a plan to run a SQL query
42+
let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
43+
44+
// execute and print results
45+
df.show().await?;
46+
Ok(())
47+
}
48+
```
49+
50+
can be distributed with few lines changed:
51+
52+
```rust,no_run
53+
use ballista::prelude::*;
54+
use datafusion::prelude::*;
55+
56+
#[tokio::main]
57+
async fn main() -> datafusion::error::Result<()> {
58+
// create SessionContext with ballista support
59+
// standalone context will start all required
60+
// ballista infrastructure in the background as well
61+
let ctx = SessionContext::standalone().await?;
62+
63+
// everything else remains the same
64+
65+
// register the table
66+
ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new())
67+
.await?;
68+
69+
// create a plan to run a SQL query
70+
let df = ctx
71+
.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100")
72+
.await?;
73+
74+
// execute and print results
75+
df.show().await?;
76+
Ok(())
77+
}
78+
```
4179

4280
## Starting a cluster
4381

44-
There are numerous ways to start a Ballista cluster, including support for Docker and
45-
Kubernetes. For full documentation, refer to the deployment section of the
46-
[Ballista User Guide](https://datafusion.apache.org/ballista/user-guide/deployment/)
82+
![architecture](https://github.com/apache/datafusion-ballista/blob/main/docs/source/contributors-guide/ballista_architecture.excalidraw.svg?raw=true)
4783

48-
A simple way to start a local cluster for testing purposes is to use cargo to install
49-
the scheduler and executor crates.
84+
A simple way to start a local cluster for testing purposes is to use cargo to install the scheduler and executor crates.
5085

5186
```bash
5287
cargo install --locked ballista-scheduler
@@ -61,35 +96,27 @@ RUST_LOG=info ballista-scheduler
6196

6297
The scheduler will bind to port `50050` by default.
6398

64-
Next, start an executor processes in a new terminal session with the specified concurrency
65-
level.
99+
Next, start an executor processes in a new terminal session with the specified concurrency level.
66100

67101
```bash
68102
RUST_LOG=info ballista-executor -c 4
69103
```
70104

71-
The executor will bind to port `50051` by default. Additional executors can be started by
72-
manually specifying a bind port. For example:
105+
The executor will bind to port `50051` by default. Additional executors can be started by manually specifying a bind port.
73106

74-
```bash
75-
RUST_LOG=info ballista-executor --bind-port 50052 -c 4
76-
```
107+
For full documentation, refer to the deployment section of the
108+
[Ballista User Guide](https://datafusion.apache.org/ballista/user-guide/deployment/)
77109

78-
## Executing a query
110+
## Executing a Query
79111

80-
Ballista provides a `BallistaContext` as a starting point for creating queries. DataFrames can be created
81-
by invoking the `read_csv`, `read_parquet`, and `sql` methods.
112+
Ballista provides a custom `SessionContext` as a starting point for creating queries. DataFrames can be created by invoking the `read_csv`, `read_parquet`, and `sql` methods.
82113

83114
To build a simple ballista example, run the following command to add the dependencies to your `Cargo.toml` file:
84115

85116
```bash
86117
cargo add ballista datafusion tokio
87118
```
88119

89-
The following example runs a simple aggregate SQL query against a Parquet file (`yellow_tripdata_2022-01.parquet`) from the
90-
[New York Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
91-
data set. Download the file and add it to the `testdata` folder before running the example.
92-
93120
```rust,no_run
94121
use ballista::prelude::*;
95122
use datafusion::common::Result;
@@ -99,7 +126,6 @@ use datafusion::functions_aggregate::{min_max::min, min_max::max, sum::sum, aver
99126
#[tokio::main]
100127
async fn main() -> Result<()> {
101128
102-
103129
// connect to Ballista scheduler
104130
let ctx = SessionContext::remote("df://localhost:50050").await?;
105131
@@ -121,13 +147,6 @@ async fn main() -> Result<()> {
121147
)?
122148
.sort(vec![col("passenger_count").sort(true, true)])?;
123149
124-
// this is equivalent to the following SQL
125-
// SELECT passenger_count, MIN(fare_amount), MAX(fare_amount), AVG(fare_amount), SUM(fare_amount)
126-
// FROM tripdata
127-
// GROUP BY passenger_count
128-
// ORDER BY passenger_count
129-
130-
// print the results
131150
df.show().await?;
132151
133152
Ok(())
@@ -146,12 +165,31 @@ The output should look similar to the following table.
146165
| 2 | -250 | 640.5 | 13.79501011585127 | 4732047.139999998 |
147166
| 3 | -130 | 480 | 13.473184817311106 | 1139427.2400000002 |
148167
| 4 | -250 | 464 | 14.232650547832726 | 502711.4499999997 |
149-
| 5 | -52 | 668 | 12.160378472086954 | 624289.51 |
150-
| 6 | -52 | 252.5 | 12.576583325529857 | 402916 |
151-
| 7 | 7 | 79 | 61.77777777777778 | 556 |
152-
| 8 | 8.3 | 115 | 79.9125 | 639.3 |
153-
| 9 | 9.3 | 96.5 | 65.26666666666667 | 195.8 |
154168
+-----------------+--------------------------+--------------------------+--------------------------+--------------------------+
155169
```
156170

157171
More [examples](../../examples/examples/) can be found in the arrow-ballista repository.
172+
173+
## Performance
174+
175+
We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.
176+
177+
These are benchmarks derived from TPC-H and not official TPC-H benchmarks. These results are from running individual queries at scale factor 100 (100 GB) on a single node with a single executor and 8 concurrent tasks.
178+
179+
### Overall Speedup
180+
181+
The overall speedup is 2.9x
182+
183+
![benchmarks](https://github.com/apache/datafusion-ballista/blob/main/docs/source/_static/images/tpch_allqueries.png?raw=true)
184+
185+
### Per Query Comparison
186+
187+
![benchmarks](https://github.com/apache/datafusion-ballista/blob/main/docs/source/_static/images/tpch_queries_compare.png?raw=true)
188+
189+
### Relative Speedup
190+
191+
![benchmarks](https://github.com/apache/datafusion-ballista/blob/main/docs/source/_static/images/tpch_queries_speedup_rel.png?raw=true)
192+
193+
### Absolute Speedup
194+
195+
![benchmarks](https://github.com/apache/datafusion-ballista/blob/main/docs/source/_static/images/tpch_queries_speedup_abs.png?raw=true)

0 commit comments

Comments
 (0)