You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ballista/client/README.md
+84-46Lines changed: 84 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,34 +19,69 @@
19
19
20
20
# Ballista: Distributed Scheduler for Apache Arrow DataFusion
21
21
22
-
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and
23
-
DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and
24
-
Java) to be supported as first-class citizens without paying a penalty for serialization costs.
22
+
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.
-[Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
29
-
-[Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
30
-
data transfer between processes.
31
-
-[Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
32
-
-[Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
26
+
Ballista is a distributed query execution engine that enhances [Apache DataFusion](https://github.com/apache/datafusion) by enabling the parallelized execution of workloads across multiple nodes in a distributed environment.
33
27
34
-
Ballista can be deployed as a standalone cluster and also supports [Kubernetes](https://kubernetes.io/). In either
35
-
case, the scheduler can be configured to use [etcd](https://etcd.io/) as a backing store to (eventually) provide
For full documentation, refer to the deployment section of the
108
+
[Ballista User Guide](https://datafusion.apache.org/ballista/user-guide/deployment/)
77
109
78
-
## Executing a query
110
+
## Executing a Query
79
111
80
-
Ballista provides a `BallistaContext` as a starting point for creating queries. DataFrames can be created
81
-
by invoking the `read_csv`, `read_parquet`, and `sql` methods.
112
+
Ballista provides a custom `SessionContext` as a starting point for creating queries. DataFrames can be created by invoking the `read_csv`, `read_parquet`, and `sql` methods.
82
113
83
114
To build a simple ballista example, run the following command to add the dependencies to your `Cargo.toml` file:
84
115
85
116
```bash
86
117
cargo add ballista datafusion tokio
87
118
```
88
119
89
-
The following example runs a simple aggregate SQL query against a Parquet file (`yellow_tripdata_2022-01.parquet`) from the
90
-
[New York Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
91
-
data set. Download the file and add it to the `testdata` folder before running the example.
92
-
93
120
```rust,no_run
94
121
use ballista::prelude::*;
95
122
use datafusion::common::Result;
@@ -99,7 +126,6 @@ use datafusion::functions_aggregate::{min_max::min, min_max::max, sum::sum, aver
99
126
#[tokio::main]
100
127
async fn main() -> Result<()> {
101
128
102
-
103
129
// connect to Ballista scheduler
104
130
let ctx = SessionContext::remote("df://localhost:50050").await?;
More [examples](../../examples/examples/) can be found in the arrow-ballista repository.
172
+
173
+
## Performance
174
+
175
+
We run some simple benchmarks comparing Ballista with Apache Spark to track progress with performance optimizations.
176
+
177
+
These are benchmarks derived from TPC-H and not official TPC-H benchmarks. These results are from running individual queries at scale factor 100 (100 GB) on a single node with a single executor and 8 concurrent tasks.
0 commit comments