Skip to content

Commit e577322

Browse files
committed
Version doc for v0.0.2
1 parent 353f79e commit e577322

File tree

18 files changed

+627
-0
lines changed

18 files changed

+627
-0
lines changed
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: Storage
3+
weight: 1
4+
---
5+
The research conducted with LingoDB does not focus on storage aspects of database systems.
6+
Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
7+
8+
## In-Memory Format: Apache Arrow
9+
The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
10+
Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
11+
12+
## Persistent Storage
13+
For many practical purposes, persistent storage is required.
14+
We chose a pragmatic approach:
15+
16+
1. Each database is represented by multiple files placed in one *database directory*
17+
2. In this directory, each table is represented by multiple files, each starting with the name of the table:
18+
1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
19+
2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
20+
3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
21+
22+
Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: Design
3+
type: docs
4+
weight: 4
5+
---
6+
7+
This section gives an overview over the overall design of LingoDB.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
2+
LingoDB is an open-source project that welcomes contributions from the community.
3+
However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
4+
Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
5+
Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
6+
Thus, please follow the guidelines below when planning to contribute to LingoDB.
7+
8+
### Micro-Changes such as fixing typos, etc
9+
If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
10+
We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
11+
12+
Examples:
13+
* Typos
14+
* Slight rephrasing of existing sentences
15+
* Updating npm dependencies
16+
* ...
17+
18+
### Medium-sized Changes: Create a Pull Request
19+
If you want to contribute a medium-sized change, please create a pull request in the respective repository.
20+
21+
Examples:
22+
* Any changes to the documentation
23+
* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
24+
* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
25+
* Adding new tests
26+
27+
### Large Changes: Discuss first
28+
If you want to contribute a larger change, please open an issue in the respective repository first.
29+
This way, we can discuss the change before you start working on it and we can avoid situations like:
30+
* You working on a feature that is already in development
31+
* You working on a feature that is not in line with the project's goals and won't be merged
32+
* You working on a feature that will not be working soon due to other changes in the project
33+
34+
Examples:
35+
* Add a new compilation backend/target
36+
* Refactor the SQL parser
37+
* Refactorings
38+
* Larger features that touch the code base in many places
39+
* Anything that is more "researchy"
40+
41+
### Before Creating a Pull Request
42+
Before creating a pull request, please make sure that
43+
* the CI pipeline passes and the coverage does not decrease.
44+
* the code is formatted according to the `.clang-format` file in the repository
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Debugging & Profiling
3+
---
4+
5+
Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
6+
Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
7+
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
8+
9+
## Guide: Profiling queries
10+
For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
11+
For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
12+
13+
1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
14+
2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
15+
16+
## Guide: Debugging
17+
* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb).
18+
* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
19+
* If yes: debug with this backend.
20+
* If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
21+
22+
## Components for Debugging and Profiling
23+
### Location Tracking in MLIR
24+
In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
25+
While it is possible to provide a *Unknown Location*, it should be avoided.
26+
When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
27+
When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
28+
**All passes in LingoDB ensure that correct locations are set afterwards.**
29+
30+
### Snapshotting
31+
MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
32+
Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
33+
34+
If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
35+
36+
Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
37+
1. get the origin location of the current operation by looking in the appropriate snapshot file
38+
2. find the origin operation by going to this location
39+
40+
### Special Compiler Backends
41+
In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
42+
43+
#### LLVM-Debug Backend
44+
Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
45+
This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
46+
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
47+
This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
48+
49+
#### C++-Backend
50+
For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
51+
This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
52+
Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
53+
This shared library is then loaded with `dlopen` and the main function is called.
54+
Thus, the generated code can be debugged as any usual C++ program.
55+
To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
56+
57+
58+
### Lightweight Tracing
59+
When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
60+
This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
* All "non-standard" dependencies are packaged as python programs
2+
* Also MLIR/LLVM is packaged as a python program.
3+
* ***This will be subject to change in the near future!*** We are working on using system-wide installed MLIR/LLVM packages and reduce the number of dependencies in general.
4+
5+
6+
### Building the custom MLIR/LLVM package
7+
* in `tools/mlir-package`:
8+
* `docker build -t mlir-package .`
9+
* `docker run -v ".:/built-packages" -v ".:/repo" --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Python Package
3+
---
4+
5+
Currently LingoDB is distributed as two seperate python packages:
6+
* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
7+
* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
8+
9+
## Working on `lingo-db`
10+
If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
11+
First install the current version of the `lingodb-bridge` package.
12+
```sh
13+
pip install lingodb-bridge
14+
```
15+
Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
16+
```sh
17+
cd tools/python
18+
python -m pip install -e .
19+
```
20+
For building a release package:
21+
```sh
22+
cd tools/python
23+
python -m build .
24+
```
25+
26+
## Building `lingodb-bridge`
27+
Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
28+
```sh
29+
make build-py-bridge PYVERSION=[VERSION]
30+
```
31+
where `[VERSION]` is one of:
32+
* `310`: for Python 3.10
33+
* `311`: for Python 3.11
34+
* `312`: for Python 3.12
35+
36+
This will then create a wheel in the current directory that can be installed, e.g.:
37+
```
38+
pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
39+
```
40+
41+
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Settings
3+
---
4+
| Setting | Environment Variable | Description | Values |
5+
|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6+
| `system.execution_mode` | `LINGODB_EXECUTION_MODE` | Choose execution backend | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
7+
| `system.subop.opt` | `LINGODB_SUBOP_OPT` | Manually select SubOp optimizations | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression` |
8+
| `system.snapshot_passes` | `LINGODB_SNAPSHOT_PASSES` | Enables [snapshotting](Debugging.md#snapshotting) | Boolean value: `true` or `false` |
9+
| `system.snapshot_level` | `LINGODB_SNAPSHOT_LEVEL` | Sets the detailedness of snapshotting | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline |
10+
| `system.snapshot_dir` | `LINGODB_SNAPSHOT_DIR` | Directory for output of snapshots | (relative) path to output directory (default: `.`) |
11+
| `system.execution.perf_file` | `LINGODB_EXECUTION_PERF_FILE` | Sets the output path for the perf record output | (relative) path to output path (default: `perf.data`) |
12+
| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording | path to perf binary (default: `/usr/bin/perf`) |
13+
| `system.trace_dir` | `LINGODB_TRACE_DIR` | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`) |
14+
15+
16+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
2+
3+
## Please avoid common pitfalls
4+
* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
5+
* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
6+
* [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022)
7+
* [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
8+
* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times*
9+
* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
10+
* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
11+
12+
## Data Generation
13+
For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
14+
```sh
15+
# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
16+
# OUTPUT_DIR is the directory where the database should be stored
17+
# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
18+
19+
# Generate TPC-H database
20+
bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
21+
# Generate TPC-DS database
22+
bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
23+
# Generate JOB database
24+
bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
25+
# Generate SSB database
26+
bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
27+
```
28+
Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
29+
```sh
30+
LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
31+
sql>select count(*) from lineitem;
32+
| count |
33+
----------------------------------
34+
| 6001215 |
35+
compilation: 95.79 [ms] execution: 2.815 [ms]
36+
```
37+
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Command Line Tools
3+
type: docs
4+
weight: 2
5+
---
6+
LingoDB comes with a few command line tools to simplify experimentation, development and debugging.
7+
8+
## Interactive SQL Shell
9+
```sh
10+
$ sql DBDIR
11+
sql> select 1
12+
```
13+
14+
Similar to other systems, LingoDB can also be used interactively using the `sql` binary that is pointed to a (possibly empty) directory that holds the database to be queried. Each query must be terminated by a `;`. By default, only a *read-only* session is created. For persistent changes enter `SET persist=1;`.
15+
16+
## Converting SQL to MLIR
17+
```sh
18+
$ sql-to-mlir SQL-File DBDIR
19+
```
20+
Using the `sql-to-mlir` tool, SQL queries can be converted to a corresponding, unoptimized MLIR module. As this requires the database schema, also the database directory must be provided.
21+
22+
## Performing Optimizations and Lowerings
23+
```sh
24+
$ mlir-db-opt [--use-db DBDIR] [Passes] MLIR-File
25+
```
26+
The `mlir-db-opt` command can be used to manually apply MLIR passes on a MLIR module provided by a file. For high-level optimizations that require e.g. database statistics, the database directory should be provided using the `--use-db` argument.
27+
28+
## Running MLIR Modules
29+
30+
```sh
31+
$ run-mlir MLIR-File [DBDIR]
32+
```
33+
MLIR modules can be executed using the `run-mlir` binary. A database directory can be provided as second argument.
34+
35+
## Running SQL queries
36+
37+
```sh
38+
$ run-sql SQL-File [DBDIR]
39+
```
40+
Single (read-only) SQL queries can be run with the `run-sql` utlity. If the query requires a database, the corresponding database directory must be provided as second argument.
41+
42+
43+
## The Trace of a Query
44+
With the following commands you can explore how a SQL query gets compiled layer by layer by looking at the different files:
45+
```sh
46+
# write example query to file
47+
$ echo "select * from studenten where name='Carnap'" > test.sql
48+
# translate sql to canonical MLIR module
49+
$ sql-to-mlir test.sql resources/data/uni/ > canonical.mlir
50+
# perform query optimization
51+
$ mlir-db-opt --use-db resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir
52+
# lower relational operators to sub-operators
53+
$ mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir
54+
# lower sub-operators to imperative code
55+
$ mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir
56+
# lower database-specific scalar operations
57+
$ mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir
58+
# lower mid-level abstraction (such as arrow tables) to low-level imperative code
59+
$ mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir
60+
```
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
title: Installation
3+
type: docs
4+
weight: 1
5+
---
6+
7+
## Python Package
8+
Install via pip, then use as [documented here](./Python.md)
9+
```
10+
pip install lingodb
11+
```
12+
13+
## Docker Image
14+
You can build the docker image yourself using `make build-docker`
15+
16+
## Building from source
17+
1. Ensure you have a machine with sufficient compute power and space
18+
1. Make sure that you have the following build dependencies installed:
19+
1. Python3.10 or higher
20+
1. standard build tools, including `cmake` and `Ninja`
21+
1. Build LingoDB
22+
* Debug Version : `make build-debug` (will create binaries under `build/lingodb-debug`)
23+
* Release Version : `make build-release` (will create binaries under `build/lingodb-release`)
24+
1. Run test: `make run-test`
25+

0 commit comments

Comments
 (0)