We welcome and encourage contributions of all kinds, such as:
- Tickets with issue reports of feature requests
- Documentation improvements
- Code (PR or PR Review)
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
You can find a curated good-first-issue list to help you get started.
This section describes how you can get started at developing DataFusion.
wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip
choco install -y git rustup.install visualcpp-build-tools
git-bash.exe
cargo buildDataFusion is written in Rust and it uses a standard rust toolkit:
cargo buildcargo fmtto format the codecargo testto test- etc.
Testing setup:
rustup update stableDataFusion uses the latest stable release of rustgit submodule initgit submodule update
Formatting instructions:
or run them all at once:
DataFusion has several levels of tests in its Test Pyramid and tries to follow Testing Organization in the The Book.
This section highlights the most important test modules that exist
Tests for the code in an individual module are defined in the same source file with a test module, following Rust convention
There are several tests of the public interface of the DataFusion library in the tests directory.
You can run these tests individually using a command such as
cargo test -p datafusion --tests sql_integrationOne very important test is the sql_integration test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups.
The integration-tests directory contains a harness that runs certain queries against both postgres and datafusion and compares results
export POSTGRES_DB=postgres
export POSTGRES_USER=postgres
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432# Install dependencies
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r integration-tests/requirements.txt
# setup environment
POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 python -m pytest -v integration-tests/test_psql_parity.py
# Create
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c 'CREATE TABLE IF NOT EXISTS test (
c1 character varying NOT NULL,
c2 integer NOT NULL,
c3 smallint NOT NULL,
c4 smallint NOT NULL,
c5 integer NOT NULL,
c6 bigint NOT NULL,
c7 smallint NOT NULL,
c8 integer NOT NULL,
c9 bigint NOT NULL,
c10 character varying NOT NULL,
c11 double precision NOT NULL,
c12 double precision NOT NULL,
c13 character varying NOT NULL
);'
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c "\copy test FROM '$(pwd)/testing/data/csv/aggregate_test_100.csv' WITH (FORMAT csv, HEADER true);"python -m pytest -v integration-tests/test_psql_parity.pyCriterion is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion.
Criterion integrates with Cargo's built-in benchmark support and a given benchmark can be run with
cargo bench --bench BENCHMARK_NAME
A full list of benchmarks can be found here.
cargo-criterion may also be used for more advanced reporting.
The parquet SQL benchmarks can be run with
cargo bench --bench parquet_query_sql
These randomly generate a parquet file, and then benchmark queries sourced from parquet_query_sql.sql against it. This can therefore be a quick way to add coverage of particular query and/or data paths.
If the environment variable PARQUET_FILE is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset.
The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with PARQUET_FILE in subsequent runs.
Instructions and tooling for running upstream benchmark suites against DataFusion can be found in benchmarks.
These are valuable for comparative evaluation against alternative Arrow implementations and query engines.
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
- Add the actual implementation of the function:
- In core/src/physical_plan, add:
- a new variant to
BuiltinScalarFunction - a new entry to
FromStrwith the name of the function as called by SQL - a new line in
return_typewith the expected return type of the function, given an incoming type - a new line in
signaturewith the signature of the function (number and types of its arguments) - a new line in
create_physical_expr/create_physical_funmapping the built-in to the implementation - tests to the function.
- a new variant to
- In core/tests/sql, add a new test where the function is called through SQL against well known data and returns the expected result.
- In expr/src/expr_fn.rs, add:
- a new entry of the
unary_scalar_expr!macro for the new function.
- a new entry of the
- In core/src/logical_plan/mod, add:
- a new entry in the
pub use expr::{}set.
- a new entry in the
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
- Add the actual implementation of an
AccumulatorandAggregateExpr: - In datafusion/expr/src, add:
- a new variant to
AggregateFunction - a new entry to
FromStrwith the name of the function as called by SQL - a new line in
return_typewith the expected return type of the function, given an incoming type - a new line in
signaturewith the signature of the function (number and types of its arguments) - a new line in
create_aggregate_exprmapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql, add a new test where the function is called through SQL against well known data and returns the expected result.
The query plans represented by LogicalPlan nodes can be graphically
rendered using Graphviz.
To do so, save the output of the display_graphviz function to a file.:
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());Then, use the dot command line tool to render it into a file that
can be displayed. For example, the following command creates a
/tmp/plan.pdf file:
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdfWe formalize DataFusion semantics and behaviors through specification documents. These specifications are useful to be used as references to help resolve ambiguities during development or code reviews.
You are also welcome to propose changes to existing specifications or create new specifications as you see fit.
Here is the list current active specifications:
All specifications are stored in the docs/source/specification folder.
We are using prettier to format .md files.
You can either use npm i -g prettier to install it globally or use npx to run it as a standalone binary. Using npx required a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade to the npm command).
$ prettier --version
2.3.0After you've confirmed your prettier version, you can format all the .md files:
prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md