lingo-db
diff --git a/‎versioned_docs/version-0.0.2/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions b/‎versioned_docs/version-0.0.2/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/Design/_index.md‎
Lines changed: 7 additions & 0 deletions b/‎versioned_docs/version-0.0.2/Design/_index.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/ForDevelopers/Contributing.md‎
Lines changed: 44 additions & 0 deletions b/‎versioned_docs/version-0.0.2/ForDevelopers/Contributing.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/ForDevelopers/Debugging.md‎
Lines changed: 60 additions & 0 deletions b/‎versioned_docs/version-0.0.2/ForDevelopers/Debugging.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/ForDevelopers/Dependencies.md‎
Lines changed: 9 additions & 0 deletions b/‎versioned_docs/version-0.0.2/ForDevelopers/Dependencies.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions b/‎versioned_docs/version-0.0.2/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/ForDevelopers/Settings.md‎
Lines changed: 16 additions & 0 deletions b/‎versioned_docs/version-0.0.2/ForDevelopers/Settings.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/GettingStarted/Benchmarking.md‎
Lines changed: 37 additions & 0 deletions b/‎versioned_docs/version-0.0.2/GettingStarted/Benchmarking.md‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/GettingStarted/CommandLineTools.md‎
Lines changed: 60 additions & 0 deletions b/‎versioned_docs/version-0.0.2/GettingStarted/CommandLineTools.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.2/GettingStarted/Install.md‎
Lines changed: 25 additions & 0 deletions b/‎versioned_docs/version-0.0.2/GettingStarted/Install.md‎
Lines changed: 25 additions & 0 deletions
@@ -0,0 +1,22 @@
+---
+title: Storage
+weight: 1
+---
+The research conducted with LingoDB does not focus on storage aspects of database systems.
+Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
+
+## In-Memory Format: Apache Arrow
+The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
+Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
+
+## Persistent Storage
+For many practical purposes, persistent storage is required.
+We chose a pragmatic approach:
+
+1. Each database is represented by multiple files placed in one *database directory*
+2. In this directory, each table is represented by multiple files, each starting with the name of the table:
+    1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
+    2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
+    3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
+
+Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
@@ -0,0 +1,7 @@
+---
+title: Design
+type: docs
+weight: 4
+---
+
+This section gives an overview over the overall design of LingoDB.
@@ -0,0 +1,44 @@
+
+LingoDB is an open-source project that welcomes contributions from the community.
+However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
+Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
+Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
+Thus, please follow the guidelines below when planning to contribute to LingoDB.
+
+### Micro-Changes such as fixing typos, etc
+If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
+We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
+
+Examples:
+* Typos
+* Slight rephrasing of existing sentences
+* Updating npm dependencies
+* ...
+
+### Medium-sized Changes: Create a Pull Request
+If you want to contribute a medium-sized change, please create a pull request in the respective repository.
+
+Examples:
+* Any changes to the documentation
+* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
+* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
+* Adding new tests
+
+### Large Changes: Discuss first
+If you want to contribute a larger change, please open an issue in the respective repository first.
+This way, we can discuss the change before you start working on it and we can avoid situations like:
+* You working on a feature that is already in development
+* You working on a feature that is not in line with the project's goals and won't be merged
+* You working on a feature that will not be working soon due to other changes in the project
+
+Examples:
+* Add a new compilation backend/target
+* Refactor the SQL parser
+* Refactorings
+* Larger features that touch the code base in many places
+* Anything that is more "researchy"
+
+### Before Creating a Pull Request
+Before creating a pull request, please make sure that
+* the CI pipeline passes and the coverage does not decrease.
+* the code is formatted according to the `.clang-format` file in the repository
@@ -0,0 +1,60 @@
+---
+title: Debugging & Profiling
+---
+
+Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
+Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
+Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
+
+## Guide: Profiling queries
+For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
+For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
+
+1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
+2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
+
+## Guide: Debugging
+* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb). 
+* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
+  * If yes: debug with this backend. 
+  * If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
+
+## Components for Debugging and Profiling
+### Location Tracking in MLIR
+In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
+While it is possible to provide a *Unknown Location*, it should be avoided.
+When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
+When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
+**All passes in LingoDB ensure that correct locations are set afterwards.**
+
+### Snapshotting
+MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
+Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
+
+If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
+
+Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
+1. get the origin location of the current operation by looking in the appropriate snapshot file
+2. find the origin operation by going to this location
+
+### Special Compiler Backends
+In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
+
+#### LLVM-Debug Backend
+Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
+This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
+During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
+This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
+
+#### C++-Backend
+For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
+This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
+Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
+This shared library is then loaded with `dlopen` and the main function is called.
+Thus, the generated code can be debugged as any usual C++ program.
+To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
+
+
+### Lightweight Tracing
+When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
+This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
@@ -0,0 +1,9 @@
+* All "non-standard" dependencies are packaged as python programs
+* Also MLIR/LLVM is packaged as a python program.
+* ***This will be subject to change in the near future!*** We are working on using system-wide installed MLIR/LLVM packages and reduce the number of dependencies in general.
+
+
+### Building the custom MLIR/LLVM package
+* in `tools/mlir-package`:  
+    * `docker build -t mlir-package .`
+    *  `docker run -v ".:/built-packages" -v ".:/repo"  --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`
@@ -0,0 +1,41 @@
+---
+title: Python Package
+---
+
+Currently LingoDB is distributed as two seperate python packages: 
+* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
+* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
+
+## Working on `lingo-db`
+If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
+First install the current version of the `lingodb-bridge` package.
+```sh
+pip install lingodb-bridge
+```
+Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
+```sh
+cd tools/python
+python -m pip install -e .
+```
+For building a release package:
+```sh
+cd tools/python
+python -m build .
+```
+
+## Building `lingodb-bridge`
+Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
+```sh
+make build-py-bridge PYVERSION=[VERSION]
+```
+where `[VERSION]` is one of:
+* `310`: for Python 3.10
+* `311`: for Python 3.11
+* `312`: for Python 3.12
+
+This will then create a wheel in the current directory that can be installed, e.g.:
+```
+pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
+```
+
+
@@ -0,0 +1,16 @@
+---
+title: Settings
+---
+| Setting                        | Environment Variable            | Description                                                                           | Values                                                                                                                                                                                                |
+|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `system.execution_mode`        | `LINGODB_EXECUTION_MODE`        | Choose execution backend                                                              | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
+| `system.subop.opt`             | `LINGODB_SUBOP_OPT`             | Manually select SubOp optimizations                                                   | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression`                                                                              |
+| `system.snapshot_passes`       | `LINGODB_SNAPSHOT_PASSES`       | Enables [snapshotting](Debugging.md#snapshotting)                                     | Boolean value: `true` or `false`                                                                                                                                                                      |
+| `system.snapshot_level`        | `LINGODB_SNAPSHOT_LEVEL`        | Sets the detailedness of snapshotting                                                 | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline                                                               |
+| `system.snapshot_dir`          | `LINGODB_SNAPSHOT_DIR`          | Directory for output of snapshots                                                     | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+| `system.execution.perf_file`   | `LINGODB_EXECUTION_PERF_FILE`   | Sets the output path for the perf record output                                       | (relative) path to output path (default: `perf.data`)                                                                                                                                                 |
+| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording                           | path to perf binary (default: `/usr/bin/perf`)                                                                                                                                                        |
+| `system.trace_dir`             | `LINGODB_TRACE_DIR`             | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+
+
+
@@ -0,0 +1,37 @@
+LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
+
+## Please avoid common pitfalls
+* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
+* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
+  * [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022) 
+  * [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
+* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times* 
+* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
+* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
+
+## Data Generation
+For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
+```sh
+# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
+# OUTPUT_DIR is the directory where the database should be stored
+# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
+
+# Generate TPC-H database
+bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate TPC-DS database
+bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate JOB database
+bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
+# Generate SSB database
+bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+```
+Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
+```sh
+LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
+sql>select count(*) from lineitem;
+|                         count  |
+----------------------------------
+|                       6001215  |
+ compilation: 95.79 [ms] execution: 2.815 [ms]
+```
+
@@ -0,0 +1,60 @@
+---
+title: Command Line Tools
+type: docs
+weight: 2
+---
+LingoDB comes with a few command line tools to simplify experimentation, development and debugging.
+
+## Interactive SQL Shell
+```sh
+$ sql DBDIR
+sql> select 1
+```
+
+Similar to other systems, LingoDB can also be used interactively using the `sql` binary that is pointed to a (possibly empty) directory that holds the database to be queried. Each query must be terminated by a `;`. By default, only a *read-only* session is created. For persistent changes enter `SET persist=1;`.
+
+## Converting SQL to MLIR
+```sh
+$ sql-to-mlir SQL-File DBDIR
+```
+Using the `sql-to-mlir` tool, SQL queries can be converted to a corresponding, unoptimized MLIR module. As this requires the database schema, also the database directory must be provided.
+
+## Performing Optimizations and Lowerings
+```sh
+$ mlir-db-opt [--use-db DBDIR] [Passes] MLIR-File
+```
+The `mlir-db-opt` command can be used to manually apply MLIR passes on a MLIR module provided by a file. For high-level optimizations that require e.g. database statistics, the database directory should be provided using the `--use-db` argument.
+
+## Running MLIR Modules
+
+```sh
+$ run-mlir MLIR-File [DBDIR]
+```
+MLIR modules can be executed using the `run-mlir` binary. A database directory can be provided as second argument.
+
+## Running SQL queries
+
+```sh
+$ run-sql SQL-File [DBDIR]
+```
+Single (read-only) SQL queries can be run with the `run-sql` utlity. If the query requires a database, the corresponding database directory must be provided as second argument.
+
+
+## The Trace of a Query
+With the following commands you can explore how a SQL query gets compiled layer by layer by looking at the different files:
+```sh
+# write example query to file
+$ echo "select * from studenten where name='Carnap'" > test.sql
+# translate sql to canonical MLIR module
+$ sql-to-mlir test.sql resources/data/uni/ > canonical.mlir
+# perform query optimization
+$ mlir-db-opt --use-db resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir
+# lower relational operators to sub-operators
+$ mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir
+# lower sub-operators to imperative code
+$ mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir
+# lower database-specific scalar operations
+$ mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir
+# lower mid-level abstraction (such as arrow tables) to low-level imperative code
+$ mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir
+```
@@ -0,0 +1,25 @@
+---
+title: Installation
+type: docs
+weight: 1
+---
+
+## Python Package
+Install via pip, then use as [documented here](./Python.md)
+```
+pip install lingodb
+```
+
+## Docker Image
+You can build the docker image yourself using `make build-docker`
+
+## Building from source
+1. Ensure you have a machine with sufficient compute power and space
+1. Make sure that you have the following build dependencies installed:
+    1. Python3.10 or higher
+    1. standard build tools, including `cmake` and `Ninja`
+1. Build LingoDB
+    * Debug Version : `make build-debug` (will create binaries under `build/lingodb-debug`)
+    * Release Version : `make build-release` (will create binaries under `build/lingodb-release`)
+1. Run test: `make run-test`
+