lingo-db
diff --git a/‎versioned_docs/version-0.0.3/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions b/‎versioned_docs/version-0.0.3/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/Design/_index.md‎
Lines changed: 7 additions & 0 deletions b/‎versioned_docs/version-0.0.3/Design/_index.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/ForDevelopers/Contributing.md‎
Lines changed: 44 additions & 0 deletions b/‎versioned_docs/version-0.0.3/ForDevelopers/Contributing.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/ForDevelopers/Debugging.md‎
Lines changed: 60 additions & 0 deletions b/‎versioned_docs/version-0.0.3/ForDevelopers/Debugging.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/ForDevelopers/Dependencies.md‎
Lines changed: 68 additions & 0 deletions b/‎versioned_docs/version-0.0.3/ForDevelopers/Dependencies.md‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions b/‎versioned_docs/version-0.0.3/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/ForDevelopers/Settings.md‎
Lines changed: 16 additions & 0 deletions b/‎versioned_docs/version-0.0.3/ForDevelopers/Settings.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.3/GettingStarted/Benchmarking.md‎
Lines changed: 37 additions & 0 deletions b/‎versioned_docs/version-0.0.3/GettingStarted/Benchmarking.md‎
Lines changed: 37 additions & 0 deletions
@@ -0,0 +1,22 @@
+---
+title: Storage
+weight: 1
+---
+The research conducted with LingoDB does not focus on storage aspects of database systems.
+Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
+
+## In-Memory Format: Apache Arrow
+The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
+Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
+
+## Persistent Storage
+For many practical purposes, persistent storage is required.
+We chose a pragmatic approach:
+
+1. Each database is represented by multiple files placed in one *database directory*
+2. In this directory, each table is represented by multiple files, each starting with the name of the table:
+    1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
+    2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
+    3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
+
+Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
@@ -0,0 +1,7 @@
+---
+title: Design
+type: docs
+weight: 4
+---
+
+This section gives an overview over the overall design of LingoDB.
@@ -0,0 +1,44 @@
+
+LingoDB is an open-source project that welcomes contributions from the community.
+However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
+Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
+Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
+Thus, please follow the guidelines below when planning to contribute to LingoDB.
+
+### Micro-Changes such as fixing typos, etc
+If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
+We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
+
+Examples:
+* Typos
+* Slight rephrasing of existing sentences
+* Updating npm dependencies
+* ...
+
+### Medium-sized Changes: Create a Pull Request
+If you want to contribute a medium-sized change, please create a pull request in the respective repository.
+
+Examples:
+* Any changes to the documentation
+* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
+* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
+* Adding new tests
+
+### Large Changes: Discuss first
+If you want to contribute a larger change, please open an issue in the respective repository first.
+This way, we can discuss the change before you start working on it and we can avoid situations like:
+* You working on a feature that is already in development
+* You working on a feature that is not in line with the project's goals and won't be merged
+* You working on a feature that will not be working soon due to other changes in the project
+
+Examples:
+* Add a new compilation backend/target
+* Refactor the SQL parser
+* Refactorings
+* Larger features that touch the code base in many places
+* Anything that is more "researchy"
+
+### Before Creating a Pull Request
+Before creating a pull request, please make sure that
+* the CI pipeline passes and the coverage does not decrease.
+* the code is formatted according to the `.clang-format` file in the repository
@@ -0,0 +1,60 @@
+---
+title: Debugging & Profiling
+---
+
+Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
+Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
+Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
+
+## Guide: Profiling queries
+For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
+For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
+
+1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
+2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
+
+## Guide: Debugging
+* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb). 
+* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
+  * If yes: debug with this backend. 
+  * If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
+
+## Components for Debugging and Profiling
+### Location Tracking in MLIR
+In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
+While it is possible to provide a *Unknown Location*, it should be avoided.
+When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
+When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
+**All passes in LingoDB ensure that correct locations are set afterwards.**
+
+### Snapshotting
+MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
+Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
+
+If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
+
+Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
+1. get the origin location of the current operation by looking in the appropriate snapshot file
+2. find the origin operation by going to this location
+
+### Special Compiler Backends
+In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
+
+#### LLVM-Debug Backend
+Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
+This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
+During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
+This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
+
+#### C++-Backend
+For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
+This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
+Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
+This shared library is then loaded with `dlopen` and the main function is called.
+Thus, the generated code can be debugged as any usual C++ program.
+To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
+
+
+### Lightweight Tracing
+When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
+This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
@@ -0,0 +1,68 @@
+LingoDB relies on three main external dependencies:
+* [LLVM/MLIR 20](https://github.com/llvm/llvm-project)
+* [Apache Arrow 19](https://arrow.apache.org/release/19.0.0.html)
+* [Boost Context 1.83](https://www.boost.org/doc/libs/1_83_0/libs/context/doc/html/index.html)
+
+**Additional tools and libraries required:**
+* C++ compiler supporting C++ 20
+* CMake 3.13.4 or newer
+* Ninja
+* lit (optional, for testing), can be e.g., installed via `pip install lit`
+
+We also provide a [Dockerfile](https://github.com/lingo-db/lingo-db/pkgs/container/lingodb-dev) that contains all dependencies and tools required to build LingoDB.
+
+When building dependencies from source, make sure that either the cmake config files are installed in a system-wide locations, or for example, the `CMAKE_PREFIX_PATH` is set accordingly.
+
+## LLVM/MLIR
+### Ubuntu/Linux
+Follow the instructions on [https://apt.llvm.org/](https://apt.llvm.org/) to install the repository on your system.
+Then install the following packages: `clang-20 llvm-20 libclang-20-dev llvm-20-dev libmlir-20-dev mlir-20-tools clang-tidy-20`
+
+### Binaries
+For other recent Linux distributions, you can also rely on the pre-built binaries provided by the LLVM project on the Github release pages.
+
+### Building from Source
+
+```shell
+wget https://github.com/llvm/llvm-project/releases/download/llvmorg-20.1.0-rc1/llvm-project-20.1.0-rc1.src.tar.xz 
+tar -xf llvm-project-20.1.0-rc1.src.tar.xz 
+mkdir llvm-project-20.1.0-rc1.src/build
+cd llvm-project-20.1.0-rc1.src
+env VIRTUAL_ENV=/venv cmake -B build  -DLLVM_ENABLE_PROJECTS="llvm;mlir;clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release -G Ninja -DLLVM_ENABLE_ASSERTIONS=OFF  -DLLVM_BUILD_TESTS=OFF -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=OFF -DLLVM_ENABLE_DUMP=ON -DLLVM_ENABLE_FFI=ON -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" -DLLVM_PARALLEL_LINK_JOBS=1 -DLLVM_PARALLEL_TABLEGEN_JOBS=10 -DBUILD_SHARED_LIBS=OFF -DLLVM_INSTALL_UTILS=ON  -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_INSTALL_PREFIX=[output-dir] llvm/
+RUN  cmake --build build --target install -j$(nproc)
+```
+
+
+## Apache Arrow
+### Ubuntu/Linux
+```shell
+wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
+apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename  --short).deb
+apt-get update
+apt-get install libarrow-dev=19.*
+```
+### Binaries
+For other recent Linux distributions, you can also rely on the pre-built binaries provided by the Apache Arrow project.
+
+### Building from Source
+
+```shell
+wget https://dlcdn.apache.org/arrow/arrow-19.0.1/apache-arrow-19.0.1.tar.gz
+tar -xf apache-arrow-19.0.1.tar.gz
+RUN cd apache-arrow-19.0.1/cpp
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=[output-dir] -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_BUILD_STATIC=ON -DARROW_CSV=ON -DARROW_COMPUTE=ON
+cmake --build build --target install -j$(nproc)
+```
+## Boost Context
+### Ubuntu/Linux
+```shell
+apt-get install libboost-context1.83-dev
+```
+### Build from Source
+```shell
+wget https://archives.boost.io/release/1.83.0/source/boost_1_83_0.tar.gz
+tar -xf boost_1_83_0.tar.gz
+cd boost_1_83_0
+ ./bootstrap.sh --prefix=/usr # or any other directory in the PATH/LD_LIBRARY_PATH
+ ./b2 install --with-context
+```
@@ -0,0 +1,41 @@
+---
+title: Python Package
+---
+
+Currently LingoDB is distributed as two seperate python packages: 
+* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
+* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
+
+## Working on `lingo-db`
+If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
+First install the current version of the `lingodb-bridge` package.
+```sh
+pip install lingodb-bridge
+```
+Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
+```sh
+cd tools/python
+python -m pip install -e .
+```
+For building a release package:
+```sh
+cd tools/python
+python -m build .
+```
+
+## Building `lingodb-bridge`
+Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
+```sh
+make build-py-bridge PYVERSION=[VERSION]
+```
+where `[VERSION]` is one of:
+* `310`: for Python 3.10
+* `311`: for Python 3.11
+* `312`: for Python 3.12
+
+This will then create a wheel in the current directory that can be installed, e.g.:
+```
+pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
+```
+
+
@@ -0,0 +1,16 @@
+---
+title: Settings
+---
+| Setting                        | Environment Variable            | Description                                                                           | Values                                                                                                                                                                                                |
+|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `system.execution_mode`        | `LINGODB_EXECUTION_MODE`        | Choose execution backend                                                              | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
+| `system.subop.opt`             | `LINGODB_SUBOP_OPT`             | Manually select SubOp optimizations                                                   | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression`                                                                              |
+| `system.snapshot_passes`       | `LINGODB_SNAPSHOT_PASSES`       | Enables [snapshotting](Debugging.md#snapshotting)                                     | Boolean value: `true` or `false`                                                                                                                                                                      |
+| `system.snapshot_level`        | `LINGODB_SNAPSHOT_LEVEL`        | Sets the detailedness of snapshotting                                                 | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline                                                               |
+| `system.snapshot_dir`          | `LINGODB_SNAPSHOT_DIR`          | Directory for output of snapshots                                                     | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+| `system.execution.perf_file`   | `LINGODB_EXECUTION_PERF_FILE`   | Sets the output path for the perf record output                                       | (relative) path to output path (default: `perf.data`)                                                                                                                                                 |
+| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording                           | path to perf binary (default: `/usr/bin/perf`)                                                                                                                                                        |
+| `system.trace_dir`             | `LINGODB_TRACE_DIR`             | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+
+
+
@@ -0,0 +1,37 @@
+LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
+
+## Please avoid common pitfalls
+* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
+* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
+  * [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022) 
+  * [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
+* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times* 
+* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
+* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
+
+## Data Generation
+For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
+```sh
+# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
+# OUTPUT_DIR is the directory where the database should be stored
+# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
+
+# Generate TPC-H database
+bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate TPC-DS database
+bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate JOB database
+bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
+# Generate SSB database
+bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+```
+Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
+```sh
+LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
+sql>select count(*) from lineitem;
+|                         count  |
+----------------------------------
+|                       6001215  |
+ compilation: 95.79 [ms] execution: 2.815 [ms]
+```
+