Skip to content

[DAPHNE-#61] Basic relational query processing based on columnar operators#961

Merged
pdamme merged 10 commits intomainfrom
61-columnar-processing
May 16, 2025
Merged

[DAPHNE-#61] Basic relational query processing based on columnar operators#961
pdamme merged 10 commits intomainfrom
61-columnar-processing

Conversation

@pdamme
Copy link
Copy Markdown
Collaborator

@pdamme pdamme commented May 2, 2025

This PR introduces initial features for relational query processing based on columnar operators in the DAPHNE compiler and runtime. This is a first step towards the goal of more efficient query processing and simpler code.

Motivation: So far, DAPHNE was already able to process relational queries (with some limitations) over frames based on frame operations that have entire frames with potentially multiple columns of heterogeneous value types as input and output. While this approach is feasible, it has some downsides, such as the complicated implementation of kernels that (1) need to interpret column type information at run-time, and (2) are black boxes for the DAPHNE compiler, thereby preventing straight-forward optimizations such as common sub-expression elimination and dead code elimination (e.g., for unused columns). Nevertheless, DAPHNE’s frames do store the data in a columnar layout and the kernels process columns internally.

Contribution: As an alternative, this PR enables a more consistent columnar query processing by making the DAPHNE compiler aware of the columnar data and ops (new column data type and a handful of columnar ops in DaphneIR) and providing simple runtime kernels for columnar operations. In detail, this PR contributes (for more information, see the individual commits):

  • A new column data type in DaphneIR plus a corresponding run-time data structure that is backed a single contiguous array. Conceptually, a column is a 1-dimensional sequence of values of a common value type.
  • A reasonable set of columnar operations, which consume and produce columns. In particular, the following operations are included: select (with six comparison ops), project, intersect, merge, join, semi-join, grouping (first and next step), elementwise binary ops (sub, mul), full aggregation (sum), and grouped aggregation (sum). These ops suffice to execute the well-known Star Schema Benchmark (SSB). These new DaphneIR ops are complemented by naive reference implementations of runtime kernels.
  • A new DAPHNE compiler pass that rewrites certain matrix/frame ops from linear/relational algebra to the new columnar ops. Furthermore, a range of simplification rewrites was added that simplify the IR after lowering to columnar ops. Many of these simplification rewrites are quite generic and could also be useful in DaphneDSL scripts without relational queries.
  • Extensive unit tests for the new kernels as well as script-level tests for entire simple and slightly more complex relational queries alternatively expressed in SQL (embedding in DaphneDSL) and pure DaphneDSL.

The code contributed by this PR should be a good foundation for (1) connecting relational processing to DAPHNE’s vectorized engine, and (2) integrating third-party libraries for SIMD-based implementations of columnar operators. Ultimately, we hope to achieve more efficient processing of relational queries in DAPHNE that way.

Limitations: However, this PR is just a first step in this direction; it still has some limitations, such as:

  • The introduction of a new column data type to DaphneIR is debatable; one could have reused the existing matrix data type (while restricting it to a single column). Furthermore, some of the columnar ops could have been expressed through existing matrix ops. Nevertheless, this new type and ops could become a separate columnar MLIR dialect inside/next to DaphneIR in the future and serve as a separate level of abstraction.
  • Currently, position lists are materialized at many points in a query plan. In the future we may use bit vectors and pass them to subsequent ops processing the intermediates. That is, the concrete design of the columnar ops may change in the future.
  • The runtime kernels are not tuned for efficiency. We plan to add highly efficient implementations in the future (maybe through DAPHNE’s means for kernel extensibility).

I made @tomschw a co-author of most commits in this PR, because parts of this PR are based on code from his PRs #640/#641 (@tomschw let me know if you have any objections). However, this PR is really a major revision and rewrite of that code. A lot was changed/added/removed compared to those two PRs.

Finally, comments are welcome. If there are not comments, I will merge this PR in two weeks, on May 16, 2025. This PR is structured into meaningful commits that should be "rebased and merged" (not squashed).

Closes #61.

@pdamme pdamme force-pushed the 61-columnar-processing branch from 192efe6 to 9105e9b Compare May 7, 2025 16:25
pdamme and others added 10 commits May 7, 2025 23:00
- To enable the DAPHNE compiler to reason about column-based processing of relational queries, this commit adds a new column data type as well as an initial set of columnar algebra operations to DaphneIR.
- The column data type and the columnar ops
  - are meant to be used internally by the DAPHNE compiler only; they are not meant to be exposed to users in DaphneDSL, because users should work at a matrix/frame abstraction level.
  - could be factored out into a separate columnar MLIR dialect in the future; for now, we add them to DaphneIR for simplicity.
- The column data type has a value type (homogeneous for all elements) and a number of rows; the implementation of the MLIR type is analogous to DAPHNE's matrix type.
- Furthermore, the column data type has a parser, printer, and verifier, and is taken into account as a DAPHNE data type by some general compiler utils.
- The columnar ops comprise select (with six comparison ops), project, intersect, merge, join, semi-join, grouping (first and next step), elementwise binary ops (sub, mul), full aggregation (sum), and grouped aggregation (sum); these ops suffice to execute the well-known Star Schema Benchmark (SSB).
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- Added type inference traits/interfaces for all of the recently introduced columnar ops.
  - Three columnar ops needed a type inference interface implementation, because they have more than one result (ColJoinOp, ColGroupFirstOp, ColGroupNextOp).
  - The other columnar ops could be handled with the existing type inference traits plus the newly added trait DataTypeCol (specifies that the result's data type is the recently added ColumnType).
- Extended the type inference interface implementation of CastOp to support the recently introduced ColumnType for the argument and/or result (analogous to MatrixType).
- InferencePass and some general inference utils take the recently added ColumnType into account.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- This commit adds a new DAPHNE compiler pass RewriteToColumnarOpsPass, which rewrites certain matrix/frame ops from linear/relational algebra to columnar ops from column algebra.
  - Happens after SQL parsing and type/property inference because we usually want to lower relational queries to columnar ops and need information on types and shapes.
  - The general idea:
    - Identify individual matrix/frame ops that could be expressed by columnar ops.
    - Then, each of these ops is replaced in isolation by creating casts/conversions of its arguments as needed, creating the columnar op(s), and creating casts/conversions of the results as needed.
    - In the end, the results of the rewritten DAG of operations are the same as of the replaced op.
    - After these replacements of individual ops, the IR may contain lots of redundant operations or operations elimiating each other's effects.
    - Such issues are not addressed by this pass, but are subject to simplifications in subsequent passes (concrete simplifcation rewrites will be added in a follow-up commit).
  - By default, this pass is not executed; it can be turned on by the newly added `--columnar` CLI argument or the `use_columnar` config item.
  - Optionally, the IR after this pass (and a few related ones) can be displayed by the `--explain columnar` CLI arg or the `explain_columnar` config item.
  - The pass is directly followed by an inference pass to infer the result types of the newly created columnar ops and other helper ops.
- Added two new helper ops to DaphneIR: ConvertBitmapToPosListOp and ConvertPosListToBitmapOp.
  - Needed to convert between the bit vectors produced by DaphneIR's elementwise comparison ops on matrices and the recently introduced columnar ops, which typically expect position lists.
  - With the simplification rewrites we will add soon, these conversions will be eliminated afterwards.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- This commit contributes various simplification rewrites that remove redundancies introduced during lowering to columnar ops.
- These rewrites are implemented as canonicalizations of the respective ops, so there is no dedicated pass for simplifying the columnar IR.
- The reason is that most of these rewrites are not specific to columnar ops at all, but could be applied in many DaphneDSL scripts; thus, they should always be applied, not just when columnar processing is turned on.
- DaphneIR ops that can become unused through these rewrites were assigned the Pure trait, such that CSE/DCE can remove them.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- This commit makes sure that the recently introduced columnar ops and column type can be lowered to CallKernelOp and the LLVM dialect.
- To this end, some small but decisive additions were made to RewriteToCallKernelOpPass, LowerToLLVMPass, KernelCatalogParser, and CompilerUtils.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- This commits adds a simple implementation of a column runtime data structure to be consumed and produced by the upcoming kernels for columnar ops.
- A column has a homogeneous value type and is backed by a dense array of values stored contiguously in memory.
- So far, slicing and serialization are not supported on columns yet.
- New kernels:
  - This commit adds specializations of the castObj-kernel for casts between Column and DenseMatrix as well as between Column and Frame.
  - Furthermore, a specialization for the castObjSca-kernel for casts from a (1x1) Column to a scalar was added.
- Utilities for unit tests:
  - The checkEq-kernel got a specialization for Column.
  - The genGivenVals() utility supports Column now.
    - So far, the #rows needed to be passed to genGivenVals() and the #columns was deduced.
    - As a Column instance always has a single column, the #rows would be redundant and actually annoying when writing unit tests.
    - Thus, an alternative variant of genGivenVals() was added which does not require the #rows; it generates a data object with a single column; this new variant can be used with matrices, too.
- Added unit test cases for the new kernels.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- Added kernels for all currently supported columnar ops.
- Added unit tests for all these kernels.
- Added kernels for ConvertPosListToBitmapOp and ConvertBitmapToPosListOp.
- Added unit tests for these kernels.
- The code in this commit is partly based on code from PRs #640/#641 by @tomschw.

Co-authored-by: Tom Schwarzburg <tom.schwarzburg@mailbox.tu-dresden.de>
- This commit adds a few script-level test cases that check if the processing of relational queries using columnar ops works end-to-end.
- Each test case consists of a relational query over some schema and data.
- Each query is expressed in two ways: (a) in DaphneDSL and (b) in SQL (embedded into DaphneDSL); that way, we make sure that the slightly different IRs after pure DaphneDSL and DAPHNE's SQL parser both work with columnar lowering, simplifications, and execution.
- The test cases check:
  - That the query result is correct both with columnar processing (using columnar ops) and without columnar processing (using matrix/frame ops).
  - That the IR after lowering to columnar ops:
    - contains certain columnar ops and doesn't contain certain matrix/frame ops, when columnar processing is turned on.
    - doesn not contain certain columnar ops and contains certain matrix/frame ops, when columnar processing is turned off.
    - does not contain the ops converting between bitmaps and position lists, no matter if columnar processing is turned on/off (because these ops should always be optimized away)
- Little fix: Made shape inference aware of the Column data type.
- Closes #61.
@pdamme pdamme force-pushed the 61-columnar-processing branch from 9105e9b to f86f15c Compare May 7, 2025 21:15
@pdamme pdamme mentioned this pull request May 15, 2025
@pdamme pdamme merged commit 31d2c67 into main May 16, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lowering from relational/frame operations to columnar operations

1 participant