DataFusion Distributed

Library that brings distributed execution capabilities to Apache DataFusion.

What can you do with this crate?

This crate is a toolkit that extends Apache DataFusion with distributed capabilities, providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the networking stack used for hosting the different workers involved in a query.

Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed capabilities with minimal changes.

Core tenets of the project

Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and a familiar API for building applications.
Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are expected to leverage their own infrastructure for hosting DataFusion nodes.
No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.

Benchmarks

Docs

The user and contributor guide can be found here:

https://datafusion-contrib.github.io/datafusion-distributed

Getting familiar with distributed DataFusion

There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in examples/:

localhost_worker.rs: code that spawns an Arrow Flight Endpoint listening for physical plans over the network.
localhost_run.rs: code that distributes a query across the spawned Arrow Flight Endpoints and executes it.

The integration tests also provide an idea about how to use the library and what can be achieved with it:

tpch_validation_test.rs: executes all TPCH queries and performs assertions over the distributed plans.
custom_config_extension.rs: showcases how to propagate custom DataFusion config extensions.
custom_extension_codec.rs: showcases how to propagate custom physical extension codecs.
distributed_aggregation.rs: showcases how to manually place ArrowFlightReadExec nodes in a plan and build a distributed query out of it.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
src		src
testdata		testdata
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataFusion Distributed

What can you do with this crate?

Core tenets of the project

Benchmarks

Docs

Getting familiar with distributed DataFusion

About

Uh oh!

Releases

Contributors 14

Languages

License

datafusion-contrib/datafusion-distributed

Folders and files

Latest commit

History

Repository files navigation

DataFusion Distributed

What can you do with this crate?

Core tenets of the project

Benchmarks

Docs

Getting familiar with distributed DataFusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 14

Languages