work on docs during preparations for v0.0.2

jungmair · jungmair · commit 353f79ed8f06 · 2025-01-15T18:27:33.000+01:00
diff --git a/docs/ForDevelopers/Contributing.md b/docs/ForDevelopers/Contributing.md
@@ -0,0 +1,44 @@
+
+LingoDB is an open-source project that welcomes contributions from the community.
+However, it is also a research project that still undergoes major changes (often not in public repositories) that might conflict with your contributions.
+Furthermore, the project is developed by a very small team of researchers and students, which means that we have limited resources to review and merge pull requests.
+Finally, we have to ensure that the codebase stays maintainable and that the project's goals are met.
+Thus, please follow the guidelines below when planning to contribute to LingoDB.
+
+### Micro-Changes such as fixing typos, etc
+If you find a small typo or similar in one of the LingoDB repositories, please open an *Issue* in the respective repository.
+We won't accept pull requests for such small changes, but we will be happy to fix them ourselves as soon as possible.
+
+Examples:
+* Typos
+* Slight rephrasing of existing sentences
+* Updating npm dependencies
+* ...
+
+### Medium-sized Changes: Create a Pull Request
+If you want to contribute a medium-sized change, please create a pull request in the respective repository.
+
+Examples:
+* Any changes to the documentation
+* Bug-Fixes that do not require large changes/redesign (e.g., fixing a segfault)
+* Smallish new features (e.g., adding a new command line option, adding a new SQL function (e.g., `sin`))
+* Adding new tests
+
+### Large Changes: Discuss first
+If you want to contribute a larger change, please open an issue in the respective repository first.
+This way, we can discuss the change before you start working on it and we can avoid situations like:
+* You working on a feature that is already in development
+* You working on a feature that is not in line with the project's goals and won't be merged
+* You working on a feature that will not be working soon due to other changes in the project
+
+Examples:
+* Add a new compilation backend/target
+* Refactor the SQL parser
+* Refactorings
+* Larger features that touch the code base in many places
+* Anything that is more "researchy"
+
+### Before Creating a Pull Request
+Before creating a pull request, please make sure that
+* the CI pipeline passes and the coverage does not decrease.
+* the code is formatted according to the `.clang-format` file in the repository
diff --git a/docs/ForDevelopers/Debugging.md b/docs/ForDevelopers/Debugging.md
@@ -1,19 +1,25 @@
 ---
-title: Debugging
+title: Debugging & Profiling
 ---
 
 Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
-Especially debugging can become a challenge, as one not only needs to debug the engine code, but also the generated code.
-When debugging generated code typically two main questions arise:
+Especially debugging and profiling can become a challenge, as one not only needs to debug and profile the engine code, but also the generated code.
+Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321) and [profiling Umbra](https://dl.acm.org/doi/abs/10.1145/3447786.3456254).
 
-1. Where exactly is the generated code wrong?
-2. Where does this wrong part come from?
+## Guide: Profiling queries
+For profiling queries LingoDB comes with a *ct* tool that collects several metrics.
+For the following instructions, we assume that LingoDB was built in Release mode with debugging informations (`build/lingodb-relwithdebinfo/.buildstamp` ).
 
-Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321).
+1. Run the ct.py script with query and dataset: `python3 tools/ct/ct.py resources/sql/tpch/1.sql resources/data/tpch-1/`. If the build directory is not `build/lingodb-relwithdebinfo`, it can be supplied with the `BIN_DIR` environment variable
+2. Open the resulting `ct.json` file with the [CT viewer](https://ct.lingo-db.com) and explore it in detail
 
-## General Approach in LingoDB
-To solve these challenges in LingoDB, we use a combination of location tracking, snapshotting, and alternative execution engines.
+## Guide: Debugging
+* If the compilation fails: Use [Snapshotting](#snapshotting) to identify the broken/problematic pass. Then run the pass isolated with [mlir-db-opt](../GettingStarted/CommandLineTools.md#performing-optimizations-and-lowerings) for detailed debugging (e.g., with gdb). 
+* If compilation succeeds but execution fails in/because generated code: First check if the error persists when switching to the [C++-Backend](#c-backend) if possible (i.e., all MLIR operations are supported)
+  * If yes: debug with this backend. 
+  * If not: you should use the [LLVM Debug Backend](#llvm-debug-backend)
 
+## Components for Debugging and Profiling
 ### Location Tracking in MLIR
 In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
 While it is possible to provide a *Unknown Location*, it should be avoided.
@@ -25,31 +31,19 @@ When new operations are created during a pass they are usually annotated with th
 MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
 Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
 
-If enabled, LingoDB performs multiple location snapshots on multiple abstraction levels (in the current working directory):
-1. `input.mlir`: initial MLIR module that is e.g., produced from an SQL query
-2. `snapshot-0.mlir`: location snapshot after query optimization
-3. `snapshot-1.mlir`: location snapshot after lowering high-level operators to sub-operators
-4. `snapshot-2.mlir`: location snapshot after lowering sub-operators to imperative operations
-5. `snapshot-3.mlir`: location snapshot after lowering high-level imperative operations
-6. `snapshot-4.mlir`: final location snapshot of low-level IR (e.g., llvm dialect)
+If enabled (cf [Settings](Settings.md) ), LingoDB performs multiple location snapshots on after every or selected (important) MLIR passes.
 
 Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
 1. get the origin location of the current operation by looking in the appropriate snapshot file
 2. find the origin operation by going to this location
 
-For example, if the debugger reports a problem (e.g. SEGFAULT) at `snapshot-4.mlir:1234`,
-* We first go to line `1234` of `snapshot-4.mlir` for the problematic operation and look at the corresponding location data (e.g., `snapshot-3.mlir:42`)
-* Next, we visit line `42` of `snapshot-3.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-2.mlir:13`)
-* Next, we visit line `13` of `snapshot-2.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-1.mlir:5`)
-* Finally, we visit line `5` of `snapshot-1.mlir` to find the 'problematic' sub-operator.
-
-### Compiler Backends for Debugging
+### Special Compiler Backends
 In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
 
 #### LLVM-Debug Backend
 Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
 This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
-During the execution, standard debuggers like `gdb` will then point to the corresponding operation in `snapshot-4.mlir`.
+During the execution, standard debuggers like `gdb` will then point to the corresponding operation in the last snapshot that was performed
 This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
 
 #### C++-Backend
@@ -60,9 +54,7 @@ This shared library is then loaded with `dlopen` and the main function is called
 Thus, the generated code can be debugged as any usual C++ program.
 To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
 
-#### When to choose which backend?
-In most cases, choosing the C++-Backend is the better option, as it makes debugging much more user-friendly.
-However, there are two cases when the LLVM-Debug backend should be used:
-1. The C++-Backend may fail if unsupported MLIR operations are used for which no translation to C++ code is defined
-2. The behavior of the C++-Backend deviates from the previously expected behavior (e.g., in the case of a bug in the lowering to llvm).
 
+### Lightweight Tracing
+When compiled as `RelWithDebInfo`, LingoDB will produce a trace file with events (type, start timestamp, duration, thread) as trace.json.
+This trace file can then be opened with the [CT Viewer](https://ct.lingo-db.com)
diff --git a/docs/ForDevelopers/Dependencies.md b/docs/ForDevelopers/Dependencies.md
@@ -1,6 +1,9 @@
 * All "non-standard" dependencies are packaged as python programs
-* We are building LLVM
+* Also MLIR/LLVM is packaged as a python program.
+* ***This will be subject to change in the near future!*** We are working on using system-wide installed MLIR/LLVM packages and reduce the number of dependencies in general.
 
+
+### Building the custom MLIR/LLVM package
 * in `tools/mlir-package`:  
     * `docker build -t mlir-package .`
     *  `docker run -v ".:/built-packages" -v ".:/repo"  --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`
diff --git a/docs/ForDevelopers/Settings.md b/docs/ForDevelopers/Settings.md
@@ -0,0 +1,16 @@
+---
+title: Settings
+---
+| Setting                        | Environment Variable            | Description                                                                           | Values                                                                                                                                                                                                |
+|--------------------------------|---------------------------------|---------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `system.execution_mode`        | `LINGODB_EXECUTION_MODE`        | Choose execution backend                                                              | `DEFAULT`: LLVM O2<br/> `CHEAP`: fast LLVM <br/> `SPEED`: omit checks for speed<br/> `DEBUGGING`: LLVM O0 with debug info<br/> `C`: C Backend<br/> `PERF`: LLVM O2, with debug info, record with perf |
+| `system.subop.opt`             | `LINGODB_SUBOP_OPT`             | Manually select SubOp optimizations                                                   | Comma-seperated list of the following pass names: `GlobalOpt`, `ReuseLocal`, `Specialize`, `PullGatherUp`, `Compression`                                                                              |
+| `system.snapshot_passes`       | `LINGODB_SNAPSHOT_PASSES`       | Enables [snapshotting](Debugging.md#snapshotting)                                     | Boolean value: `true` or `false`                                                                                                                                                                      |
+| `system.snapshot_level`        | `LINGODB_SNAPSHOT_LEVEL`        | Sets the detailedness of snapshotting                                                 | `full`: Perform a snapshot after every MLIR pass<br/>`important`: only performs snapshots at selected steps in the compilation pipeline                                                               |
+| `system.snapshot_dir`          | `LINGODB_SNAPSHOT_DIR`          | Directory for output of snapshots                                                     | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+| `system.execution.perf_file`   | `LINGODB_EXECUTION_PERF_FILE`   | Sets the output path for the perf record output                                       | (relative) path to output path (default: `perf.data`)                                                                                                                                                 |
+| `system.execution.perf_binary` | `LINGODB_EXECUTION_PERF_BINARY` | Points to the perf binary that should be used for recording                           | path to perf binary (default: `/usr/bin/perf`)                                                                                                                                                        |
+| `system.trace_dir`             | `LINGODB_TRACE_DIR`             | Sets the output directory for [lightweight tracing](Debugging.md#lightweight-tracing) | (relative) path to output directory (default: `.`)                                                                                                                                                    |
+
+
+
diff --git a/docs/GettingStarted/Benchmarking.md b/docs/GettingStarted/Benchmarking.md
@@ -0,0 +1,37 @@
+LingoDB supports common OLAP benchmarks such as TPC-H, TPC-DS, JOB and SSB.
+
+## Please avoid common pitfalls
+* ***Don't use one invocation of the `sql` command to both define the schema and import the data and then run benchmark queries*** This behavior is expected to be resolved in the future!
+* Use the right LingoDB version. If you want to reproduce LingoDB's performance reported in a paper, please use the according LingoDB version:
+  * [VLDB'22](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2022) 
+  * [VLDB'23](https://github.com/lingo-db/lingo-db/releases/tag/paper-vldb-2023)
+* Also note, that the numbers reported as execution time in VLDB'22 and VLDB'23 *exclude compilation times* 
+* Do *not* manually create Apache Arrow files, but instead use the `sql` command to define tables and import data. If you miss relevant metadata information (e.g., primary keys), LingoDB will not be able to apply many optimizations and performance will be suboptimal.
+* Use a release build of LingoDB for benchmarking. Debug builds are significantly slower.
+
+## Data Generation
+For some benchmarks, the LingoDB repository contains scripts to generate data and load them:
+```sh
+# LINGODB_BINARY_DIR is the directory containing at least the `sql` binary
+# OUTPUT_DIR is the directory where the database should be stored
+# SF is the scale factor, e.g., 1 for 1GB, 10 for 10GB, etc.
+
+# Generate TPC-H database
+bash tools/generate/tpch.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate TPC-DS database
+bash tools/generate/tpcds.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+# Generate JOB database
+bash tools/generate/job.sh LINGODB_BINARY_DIR OUTPUT_DIR
+# Generate SSB database
+bash tools/generate/ssb.sh LINGODB_BINARY_DIR OUTPUT_DIR SF
+```
+Afterward, queries can be for examle run with the `sql` command that also reports execution times when the `LINGODB_SQL_REPORT_TIMES` environment variable is set:
+```sh
+LINGODB_SQL_REPORT_TIMES=1 sql OUTPUT_DIR
+sql>select count(*) from lineitem;
+|                         count  |
+----------------------------------
+|                       6001215  |
+ compilation: 95.79 [ms] execution: 2.815 [ms]
+```
+
diff --git a/docs/GettingStarted/Install.md b/docs/GettingStarted/Install.md
@@ -11,11 +11,7 @@ pip install lingodb
 ```
 
 ## Docker Image
-Either use the 
-* [prebuilt docker image](https://github.com/lingo-db/lingo-db/pkgs/container/lingo-db)
-* or build the docker image yourself using `make build-docker`
-
-The docker image then contains all the command line tools under `/build/lingodb/`
+You can build the docker image yourself using `make build-docker`
 
 ## Building from source
 1. Ensure you have a machine with sufficient compute power and space