Skip to content

Commit 353f79e

Browse files
committed
work on docs during preparations for v0.0.2
1 parent f72caec commit 353f79e

File tree

6 files changed

+122
-34
lines changed

6 files changed

+122
-34
lines changed

docs/ForDevelopers/Contributing.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
2+
LingoDB is an open-source project that welcomes contributions from the community.
3+
However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
4+
Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
5+
Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
6+
Thus, please follow the guidelines below when planning to contribute to LingoDB.
7+
8+
### Micro-Changes such as fixing typos, etc
9+
If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
10+
We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
11+
12+
Examples:
13+
* Typos
14+
* Slight rephrasing of existing sentences
15+
* Updating npm dependencies
16+
* ...
17+
18+
### Medium-sized Changes: Create a Pull Request
19+
If you want to contribute a medium-sized change, please create a pull request in the respective repository.
20+
21+
Examples:
22+
* Any changes to the documentation
23+
* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
24+
* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
25+
* Adding new tests
26+
27+
### Large Changes: Discuss first
28+
If you want to contribute a larger change, please open an issue in the respective repository first.
29+
This way, we can discuss the change before you start working on it and we can avoid situations like:
30+
* You working on a feature that is already in development
31+
* You working on a feature that is not in line with the project's goals and won't be merged
32+
* You working on a feature that will not be working soon due to other changes in the project
33+
34+
Examples:
35+
* Add a new compilation backend/target
36+
* Refactor the SQL parser
37+
* Refactorings
38+
* Larger features that touch the code base in many places
39+
* Anything that is more "researchy"
40+
41+
### Before Creating a Pull Request
42+
Before creating a pull request, please make sure that
43+
* the CI pipeline passes and the coverage does not decrease.
44+
* the code is formatted according to the `.clang-format` file in the repository

docs/ForDevelopers/Debugging.md

Lines changed: 20 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
11
---
2-
title: Debugging
2+
title: Debugging & Profiling
33
---
44

55
Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
6-
Especially debugging can become a challenge, as one not only needs to debug the engine code, but also the generated code.
7-
When debugging generated code typically two main questions arise:
6+
Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
7+
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
88

9-
1. Where exactly is the generated code wrong?
10-
2. Where does this wrong part come from?
9+
## Guide: Profiling queries
10+
For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
11+
For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
1112

12-
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321).
13+
1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
14+
2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
1315

14-
## General Approach in LingoDB
15-
To solve these challenges in LingoDB, we use a combination of location tracking, snapshotting, and alternative execution engines.
16+
## Guide: Debugging
17+
* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb).
18+
* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
19+
* If yes: debug with this backend.
20+
* If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
1621

22+
## Components for Debugging and Profiling
1723
### Location Tracking in MLIR
1824
In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
1925
While it is possible to provide a *Unknown Location*, it should be avoided.
@@ -25,31 +31,19 @@ When new operations are created during a pass they are usually annotated with th
2531
MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
2632
Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
2733

28-
If enabled, LingoDB performs multiple location snapshots on multiple abstraction levels (in the current working directory):
29-
1. `input.mlir`: initial MLIR module that is e.g., produced from an SQL query
30-
2. `snapshot-0.mlir`: location snapshot after query optimization
31-
3. `snapshot-1.mlir`: location snapshot after lowering high-level operators to sub-operators
32-
4. `snapshot-2.mlir`: location snapshot after lowering sub-operators to imperative operations
33-
5. `snapshot-3.mlir`: location snapshot after lowering high-level imperative operations
34-
6. `snapshot-4.mlir`: final location snapshot of low-level IR (e.g., llvm dialect)
34+
If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
3535

3636
Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
3737
1. get the origin location of the current operation by looking in the appropriate snapshot file
3838
2. find the origin operation by going to this location
3939

40-
For example, if the debugger reports a problem (e.g. SEGFAULT) at `snapshot-4.mlir:1234`,
41-
* We first go to line `1234` of `snapshot-4.mlir` for the problematic operation and look at the corresponding location data (e.g., `snapshot-3.mlir:42`)
42-
* Next, we visit line `42` of `snapshot-3.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-2.mlir:13`)
43-
* Next, we visit line `13` of `snapshot-2.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-1.mlir:5`)
44-
* Finally, we visit line `5` of `snapshot-1.mlir` to find the 'problematic' sub-operator.
45-
46-
### Compiler Backends for Debugging
40+
### Special Compiler Backends
4741
In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
4842

4943
#### LLVM-Debug Backend
5044
Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
5145
This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
52-
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in `snapshot-4.mlir`.
46+
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
5347
This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
5448

5549
#### C++-Backend
@@ -60,9 +54,7 @@ This shared library is then loaded with `dlopen` and the main function is called
6054
Thus, the generated code can be debugged as any usual C++ program.
6155
To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
6256

63-
#### When to choose which backend?
64-
In most cases, choosing the C++-Backend is the better option, as it makes debugging much more user-friendly.
65-
However, there are two cases when the LLVM-Debug backend should be used:
66-
1. The C++-Backend may fail if unsupported MLIR operations are used for which no translation to C++ code is defined
67-
2. The behavior of the C++-Backend deviates from the previously expected behavior (e.g., in the case of a bug in the lowering to llvm).
6857

58+
### Lightweight Tracing
59+
When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
60+
This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)

docs/ForDevelopers/Dependencies.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
* All "non-standard" dependencies are packaged as python programs
2-
* We are building LLVM
2+
* Also MLIR/LLVM is packaged as a python program.
3+
* ***This will be subject to change in the near future!*** We are working on using system-wide installed MLIR/LLVM packages and reduce the number of dependencies in general.
34

5+
6+
### Building the custom MLIR/LLVM package
47
* in `tools/mlir-package`:
58
* `docker build -t mlir-package .`
69
* `docker run -v ".:/built-packages" -v ".:/repo" --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`

docs/ForDevelopers/Settings.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Settings
3+
---
4+
| Setting | Environment Variable | Description | Values |
5+
|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
6+
| `system.execution_mode` | `LINGODB_EXECUTION_MODE` | Choose execution backend | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
7+
| `system.subop.opt` | `LINGODB_SUBOP_OPT` | Manually select SubOp optimizations | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression` |
8+
| `system.snapshot_passes` | `LINGODB_SNAPSHOT_PASSES` | Enables [snapshotting](Debugging.md#snapshotting) | Boolean value: `true` or `false` |
9+
| `system.snapshot_level` | `LINGODB_SNAPSHOT_LEVEL` | Sets the detailedness of snapshotting | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline |
10+
| `system.snapshot_dir` | `LINGODB_SNAPSHOT_DIR` | Directory for output of snapshots | (relative) path to output directory (default: `.`) |
11+
| `system.execution.perf_file` | `LINGODB_EXECUTION_PERF_FILE` | Sets the output path for the perf record output | (relative) path to output path (default: `perf.data`) |
12+
| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording | path to perf binary (default: `/usr/bin/perf`) |
13+
| `system.trace_dir` | `LINGODB_TRACE_DIR` | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`) |
14+
15+
16+
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
2+
3+
## Please avoid common pitfalls
4+
* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
5+
* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
6+
* [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022)
7+
* [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
8+
* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times*
9+
* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
10+
* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
11+
12+
## Data Generation
13+
For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
14+
```sh
15+
# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
16+
# OUTPUT_DIR is the directory where the database should be stored
17+
# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
18+
19+
# Generate TPC-H database
20+
bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
21+
# Generate TPC-DS database
22+
bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
23+
# Generate JOB database
24+
bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
25+
# Generate SSB database
26+
bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
27+
```
28+
Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
29+
```sh
30+
LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
31+
sql>select count(*) from lineitem;
32+
| count |
33+
----------------------------------
34+
| 6001215 |
35+
compilation: 95.79 [ms] execution: 2.815 [ms]
36+
```
37+

docs/GettingStarted/Install.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,7 @@ pip install lingodb
1111
```
1212

1313
## Docker Image
14-
Either use the
15-
* [prebuilt docker image](https://github.com/lingo-db/lingo-db/pkgs/container/lingo-db)
16-
* or build the docker image yourself using `make build-docker`
17-
18-
The docker image then contains all the command line tools under `/build/lingodb/`
14+
You can build the docker image yourself using `make build-docker`
1915

2016
## Building from source
2117
1. Ensure you have a machine with sufficient compute power and space

0 commit comments

Comments
 (0)