You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LingoDB is an open-source project that welcomes contributions from the community.
3
+
However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
4
+
Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
5
+
Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
6
+
Thus, please follow the guidelines below when planning to contribute to LingoDB.
7
+
8
+
### Micro-Changes such as fixing typos, etc
9
+
If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
10
+
We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
11
+
12
+
Examples:
13
+
* Typos
14
+
* Slight rephrasing of existing sentences
15
+
* Updating npm dependencies
16
+
* ...
17
+
18
+
### Medium-sized Changes: Create a Pull Request
19
+
If you want to contribute a medium-sized change, please create a pull request in the respective repository.
20
+
21
+
Examples:
22
+
* Any changes to the documentation
23
+
* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
24
+
* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
25
+
* Adding new tests
26
+
27
+
### Large Changes: Discuss first
28
+
If you want to contribute a larger change, please open an issue in the respective repository first.
29
+
This way, we can discuss the change before you start working on it and we can avoid situations like:
30
+
* You working on a feature that is already in development
31
+
* You working on a feature that is not in line with the project's goals and won't be merged
32
+
* You working on a feature that will not be working soon due to other changes in the project
33
+
34
+
Examples:
35
+
* Add a new compilation backend/target
36
+
* Refactor the SQL parser
37
+
* Refactorings
38
+
* Larger features that touch the code base in many places
39
+
* Anything that is more "researchy"
40
+
41
+
### Before Creating a Pull Request
42
+
Before creating a pull request, please make sure that
43
+
* the CI pipeline passes and the coverage does not decrease.
44
+
* the code is formatted according to the `.clang-format` file in the repository
Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
6
-
Especially debugging can become a challenge, as one not only needs to debug the engine code, but also the generated code.
7
-
When debugging generated code typically two main questions arise:
6
+
Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
7
+
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
8
8
9
-
1. Where exactly is the generated code wrong?
10
-
2. Where does this wrong part come from?
9
+
## Guide: Profiling queries
10
+
For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
11
+
For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
11
12
12
-
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321).
13
+
1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
14
+
2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
13
15
14
-
## General Approach in LingoDB
15
-
To solve these challenges in LingoDB, we use a combination of location tracking, snapshotting, and alternative execution engines.
16
+
## Guide: Debugging
17
+
* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb).
18
+
* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
19
+
* If yes: debug with this backend.
20
+
* If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
16
21
22
+
## Components for Debugging and Profiling
17
23
### Location Tracking in MLIR
18
24
In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
19
25
While it is possible to provide a *Unknown Location*, it should be avoided.
@@ -25,31 +31,19 @@ When new operations are created during a pass they are usually annotated with th
25
31
MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
26
32
Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
27
33
28
-
If enabled, LingoDB performs multiple location snapshots on multiple abstraction levels (in the current working directory):
29
-
1.`input.mlir`: initial MLIR module that is e.g., produced from an SQL query
30
-
2.`snapshot-0.mlir`: location snapshot after query optimization
31
-
3.`snapshot-1.mlir`: location snapshot after lowering high-level operators to sub-operators
32
-
4.`snapshot-2.mlir`: location snapshot after lowering sub-operators to imperative operations
33
-
5.`snapshot-3.mlir`: location snapshot after lowering high-level imperative operations
34
-
6.`snapshot-4.mlir`: final location snapshot of low-level IR (e.g., llvm dialect)
34
+
If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
35
35
36
36
Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
37
37
1. get the origin location of the current operation by looking in the appropriate snapshot file
38
38
2. find the origin operation by going to this location
39
39
40
-
For example, if the debugger reports a problem (e.g. SEGFAULT) at `snapshot-4.mlir:1234`,
41
-
* We first go to line `1234` of `snapshot-4.mlir` for the problematic operation and look at the corresponding location data (e.g., `snapshot-3.mlir:42`)
42
-
* Next, we visit line `42` of `snapshot-3.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-2.mlir:13`)
43
-
* Next, we visit line `13` of `snapshot-2.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-1.mlir:5`)
44
-
* Finally, we visit line `5` of `snapshot-1.mlir` to find the 'problematic' sub-operator.
45
-
46
-
### Compiler Backends for Debugging
40
+
### Special Compiler Backends
47
41
In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
48
42
49
43
#### LLVM-Debug Backend
50
44
Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
51
45
This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
52
-
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in `snapshot-4.mlir`.
46
+
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
53
47
This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
54
48
55
49
#### C++-Backend
@@ -60,9 +54,7 @@ This shared library is then loaded with `dlopen` and the main function is called
60
54
Thus, the generated code can be debugged as any usual C++ program.
61
55
To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
62
56
63
-
#### When to choose which backend?
64
-
In most cases, choosing the C++-Backend is the better option, as it makes debugging much more user-friendly.
65
-
However, there are two cases when the LLVM-Debug backend should be used:
66
-
1. The C++-Backend may fail if unsupported MLIR operations are used for which no translation to C++ code is defined
67
-
2. The behavior of the C++-Backend deviates from the previously expected behavior (e.g., in the case of a bug in the lowering to llvm).
68
57
58
+
### Lightweight Tracing
59
+
When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
60
+
This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
* All "non-standard" dependencies are packaged as python programs
2
-
* We are building LLVM
2
+
* Also MLIR/LLVM is packaged as a python program.
3
+
****This will be subject to change in the near future!*** We are working on using system-wide installed MLIR/LLVM packages and reduce the number of dependencies in general.
|`system.execution_mode`|`LINGODB_EXECUTION_MODE`| Choose execution backend |`DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
7
+
|`system.subop.opt`|`LINGODB_SUBOP_OPT`| Manually select SubOp optimizations | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression`|
8
+
|`system.snapshot_passes`|`LINGODB_SNAPSHOT_PASSES`| Enables [snapshotting](Debugging.md#snapshotting)| Boolean value: `true` or `false`|
9
+
|`system.snapshot_level`|`LINGODB_SNAPSHOT_LEVEL`| Sets the detailedness of snapshotting |`full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline |
10
+
|`system.snapshot_dir`|`LINGODB_SNAPSHOT_DIR`| Directory for output of snapshots | (relative) path to output directory (default: `.`) |
11
+
|`system.execution.perf_file`|`LINGODB_EXECUTION_PERF_FILE`| Sets the output path for the perf record output | (relative) path to output path (default: `perf.data`) |
12
+
|`system.execution.perf_binary`|`LINGODB_EXECUTION_PERF_BINARY`| Points to the perf binary that should be used for recording | path to perf binary (default: `/usr/bin/perf`) |
13
+
|`system.trace_dir`|`LINGODB_TRACE_DIR`| Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing)| (relative) path to output directory (default: `.`) |
LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
2
+
3
+
## Please avoid common pitfalls
4
+
****Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
5
+
* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times*
9
+
* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
10
+
* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
11
+
12
+
## Data Generation
13
+
For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
14
+
```sh
15
+
# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
16
+
# OUTPUT_DIR is the directory where the database should be stored
17
+
# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
18
+
19
+
# Generate TPC-H database
20
+
bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
21
+
# Generate TPC-DS database
22
+
bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
27
+
```
28
+
Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
0 commit comments