Skip to content

Commit 91dca11

Browse files
committed
version: 0.0.3
1 parent 25f820b commit 91dca11

File tree

18 files changed

+690
-0
lines changed

18 files changed

+690
-0
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: Storage
3+
weight: 1
4+
---
5+
The research conducted with LingoDB does not focus on storage aspects of database systems.
6+
Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
7+
8+
## In-Memory Format: Apache Arrow
9+
The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
10+
Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
11+
12+
## Persistent Storage
13+
For many practical purposes, persistent storage is required.
14+
We chose a pragmatic approach:
15+
16+
1. Each database is represented by multiple files placed in one *database directory*
17+
2. In this directory, each table is represented by multiple files, each starting with the name of the table:
18+
1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
19+
2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
20+
3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
21+
22+
Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: Design
3+
type: docs
4+
weight: 4
5+
---
6+
7+
This section gives an overview over the overall design of LingoDB.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
2+
LingoDB is an open-source project that welcomes contributions from the community.
3+
However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
4+
Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
5+
Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
6+
Thus, please follow the guidelines below when planning to contribute to LingoDB.
7+
8+
### Micro-Changes such as fixing typos, etc
9+
If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
10+
We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
11+
12+
Examples:
13+
* Typos
14+
* Slight rephrasing of existing sentences
15+
* Updating npm dependencies
16+
* ...
17+
18+
### Medium-sized Changes: Create a Pull Request
19+
If you want to contribute a medium-sized change, please create a pull request in the respective repository.
20+
21+
Examples:
22+
* Any changes to the documentation
23+
* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
24+
* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
25+
* Adding new tests
26+
27+
### Large Changes: Discuss first
28+
If you want to contribute a larger change, please open an issue in the respective repository first.
29+
This way, we can discuss the change before you start working on it and we can avoid situations like:
30+
* You working on a feature that is already in development
31+
* You working on a feature that is not in line with the project's goals and won't be merged
32+
* You working on a feature that will not be working soon due to other changes in the project
33+
34+
Examples:
35+
* Add a new compilation backend/target
36+
* Refactor the SQL parser
37+
* Refactorings
38+
* Larger features that touch the code base in many places
39+
* Anything that is more "researchy"
40+
41+
### Before Creating a Pull Request
42+
Before creating a pull request, please make sure that
43+
* the CI pipeline passes and the coverage does not decrease.
44+
* the code is formatted according to the `.clang-format` file in the repository
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Debugging & Profiling
3+
---
4+
5+
Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
6+
Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
7+
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
8+
9+
## Guide: Profiling queries
10+
For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
11+
For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
12+
13+
1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
14+
2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
15+
16+
## Guide: Debugging
17+
* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb).
18+
* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
19+
* If yes: debug with this backend.
20+
* If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
21+
22+
## Components for Debugging and Profiling
23+
### Location Tracking in MLIR
24+
In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
25+
While it is possible to provide a *Unknown Location*, it should be avoided.
26+
When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
27+
When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
28+
**All passes in LingoDB ensure that correct locations are set afterwards.**
29+
30+
### Snapshotting
31+
MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
32+
Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
33+
34+
If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
35+
36+
Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
37+
1. get the origin location of the current operation by looking in the appropriate snapshot file
38+
2. find the origin operation by going to this location
39+
40+
### Special Compiler Backends
41+
In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
42+
43+
#### LLVM-Debug Backend
44+
Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
45+
This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
46+
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
47+
This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
48+
49+
#### C++-Backend
50+
For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
51+
This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
52+
Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
53+
This shared library is then loaded with `dlopen` and the main function is called.
54+
Thus, the generated code can be debugged as any usual C++ program.
55+
To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
56+
57+
58+
### Lightweight Tracing
59+
When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
60+
This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
LingoDB relies on three main external dependencies:
2+
* [LLVM/MLIR 20](https://github.com/llvm/llvm-project)
3+
* [Apache Arrow 19](https://arrow.apache.org/release/19.0.0.html)
4+
* [Boost Context 1.83](https://www.boost.org/doc/libs/1_83_0/libs/context/doc/html/index.html)
5+
6+
**Additional tools and libraries required:**
7+
* C++ compiler supporting C++ 20
8+
* CMake 3.13.4 or newer
9+
* Ninja
10+
* lit (optional, for testing), can be e.g., installed via `pip install lit`
11+
12+
We also provide a [Dockerfile](https://github.com/lingo-db/lingo-db/pkgs/container/lingodb-dev) that contains all dependencies and tools required to build LingoDB.
13+
14+
When building dependencies from source, make sure that either the cmake config files are installed in a system-wide locations, or for example, the `CMAKE_PREFIX_PATH` is set accordingly.
15+
16+
## LLVM/MLIR
17+
### Ubuntu/Linux
18+
Follow the instructions on [https://apt.llvm.org/](https://apt.llvm.org/) to install the repository on your system.
19+
Then install the following packages: `clang-20 llvm-20 libclang-20-dev llvm-20-dev libmlir-20-dev mlir-20-tools clang-tidy-20`
20+
21+
### Binaries
22+
For other recent Linux distributions, you can also rely on the pre-built binaries provided by the LLVM project on the Github release pages.
23+
24+
### Building from Source
25+
26+
```shell
27+
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-20.1.0-rc1/llvm-project-20.1.0-rc1.src.tar.xz
28+
tar -xf llvm-project-20.1.0-rc1.src.tar.xz
29+
mkdir llvm-project-20.1.0-rc1.src/build
30+
cd llvm-project-20.1.0-rc1.src
31+
env VIRTUAL_ENV=/venv cmake -B build -DLLVM_ENABLE_PROJECTS="llvm;mlir;clang;clang-tools-extra" -DLLVM_TARGETS_TO_BUILD="X86" -DLLVM_BUILD_EXAMPLES=OFF -DCMAKE_BUILD_TYPE=Release -G Ninja -DLLVM_ENABLE_ASSERTIONS=OFF -DLLVM_BUILD_TESTS=OFF -DLLVM_BUILD_LLVM_DYLIB=ON -DLLVM_LINK_LLVM_DYLIB=OFF -DLLVM_ENABLE_DUMP=ON -DLLVM_ENABLE_FFI=ON -DCMAKE_CXX_FLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" -DLLVM_PARALLEL_LINK_JOBS=1 -DLLVM_PARALLEL_TABLEGEN_JOBS=10 -DBUILD_SHARED_LIBS=OFF -DLLVM_INSTALL_UTILS=ON -DLLVM_ENABLE_ZLIB=OFF -DCMAKE_INSTALL_PREFIX=[output-dir] llvm/
32+
RUN cmake --build build --target install -j$(nproc)
33+
```
34+
35+
36+
## Apache Arrow
37+
### Ubuntu/Linux
38+
```shell
39+
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
40+
apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
41+
apt-get update
42+
apt-get install libarrow-dev=19.*
43+
```
44+
### Binaries
45+
For other recent Linux distributions, you can also rely on the pre-built binaries provided by the Apache Arrow project.
46+
47+
### Building from Source
48+
49+
```shell
50+
wget https://dlcdn.apache.org/arrow/arrow-19.0.1/apache-arrow-19.0.1.tar.gz
51+
tar -xf apache-arrow-19.0.1.tar.gz
52+
RUN cd apache-arrow-19.0.1/cpp
53+
cmake -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=[output-dir] -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_BUILD_STATIC=ON -DARROW_CSV=ON -DARROW_COMPUTE=ON
54+
cmake --build build --target install -j$(nproc)
55+
```
56+
## Boost Context
57+
### Ubuntu/Linux
58+
```shell
59+
apt-get install libboost-context1.83-dev
60+
```
61+
### Build from Source
62+
```shell
63+
wget https://archives.boost.io/release/1.83.0/source/boost_1_83_0.tar.gz
64+
tar -xf boost_1_83_0.tar.gz
65+
cd boost_1_83_0
66+
./bootstrap.sh --prefix=/usr # or any other directory in the PATH/LD_LIBRARY_PATH
67+
./b2 install --with-context
68+
```
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Python Package
3+
---
4+
5+
Currently LingoDB is distributed as two seperate python packages:
6+
* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
7+
* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
8+
9+
## Working on `lingo-db`
10+
If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
11+
First install the current version of the `lingodb-bridge` package.
12+
```sh
13+
pip install lingodb-bridge
14+
```
15+
Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
16+
```sh
17+
cd tools/python
18+
python -m pip install -e .
19+
```
20+
For building a release package:
21+
```sh
22+
cd tools/python
23+
python -m build .
24+
```
25+
26+
## Building `lingodb-bridge`
27+
Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
28+
```sh
29+
make build-py-bridge PYVERSION=[VERSION]
30+
```
31+
where `[VERSION]` is one of:
32+
* `310`: for Python 3.10
33+
* `311`: for Python 3.11
34+
* `312`: for Python 3.12
35+
36+
This will then create a wheel in the current directory that can be installed, e.g.:
37+
```
38+
pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
39+
```
40+
41+
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Settings
3+
---
4+
| Setting | Environment Variable | Description | Values |
5+
|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6+
| `system.execution_mode` | `LINGODB_EXECUTION_MODE` | Choose execution backend | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
7+
| `system.subop.opt` | `LINGODB_SUBOP_OPT` | Manually select SubOp optimizations | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression` |
8+
| `system.snapshot_passes` | `LINGODB_SNAPSHOT_PASSES` | Enables [snapshotting](Debugging.md#snapshotting) | Boolean value: `true` or `false` |
9+
| `system.snapshot_level` | `LINGODB_SNAPSHOT_LEVEL` | Sets the detailedness of snapshotting | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline |
10+
| `system.snapshot_dir` | `LINGODB_SNAPSHOT_DIR` | Directory for output of snapshots | (relative) path to output directory (default: `.`) |
11+
| `system.execution.perf_file` | `LINGODB_EXECUTION_PERF_FILE` | Sets the output path for the perf record output | (relative) path to output path (default: `perf.data`) |
12+
| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording | path to perf binary (default: `/usr/bin/perf`) |
13+
| `system.trace_dir` | `LINGODB_TRACE_DIR` | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`) |
14+
15+
16+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
2+
3+
## Please avoid common pitfalls
4+
* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
5+
* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
6+
* [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022)
7+
* [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
8+
* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times*
9+
* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
10+
* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
11+
12+
## Data Generation
13+
For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
14+
```sh
15+
# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
16+
# OUTPUT_DIR is the directory where the database should be stored
17+
# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
18+
19+
# Generate TPC-H database
20+
bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
21+
# Generate TPC-DS database
22+
bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
23+
# Generate JOB database
24+
bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
25+
# Generate SSB database
26+
bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
27+
```
28+
Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
29+
```sh
30+
LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
31+
sql>select count(*) from lineitem;
32+
| count |
33+
----------------------------------
34+
| 6001215 |
35+
compilation: 95.79 [ms] execution: 2.815 [ms]
36+
```
37+

0 commit comments

Comments
 (0)