Skip to content

Commit f72caec

Browse files
committed
Tag old documentation as v0.0.1
1 parent e6cb2de commit f72caec

File tree

17 files changed

+546
-0
lines changed

17 files changed

+546
-0
lines changed

docusaurus.config.js

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,11 @@ const config = {
7575
position: 'left',
7676
label: 'Docs',
7777
},
78+
{
79+
type: 'docsVersionDropdown',
80+
position: 'right',
81+
dropdownActiveClassDisabled: true,
82+
},
7883
//{to: '/blog', label: 'Blog', position: 'left'},
7984
{
8085
href: 'https://github.com/lingo-db/lingo-db',
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
title: Storage
3+
weight: 1
4+
---
5+
The research conducted with LingoDB does not focus on storage aspects of database systems.
6+
Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
7+
8+
## In-Memory Format: Apache Arrow
9+
The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
10+
Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
11+
12+
## Persistent Storage
13+
For many practical purposes, persistent storage is required.
14+
We chose a pragmatic approach:
15+
16+
1. Each database is represented by multiple files placed in one *database directory*
17+
2. In this directory, each table is represented by multiple files, each starting with the name of the table:
18+
1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
19+
2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
20+
3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
21+
22+
Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: Design
3+
type: docs
4+
weight: 4
5+
---
6+
7+
This section gives an overview over the overall design of LingoDB.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: Debugging
3+
---
4+
5+
Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
6+
Especially debugging can become a challenge, as one not only needs to debug the engine code, but also the generated code.
7+
When debugging generated code typically two main questions arise:
8+
9+
1. Where exactly is the generated code wrong?
10+
2. Where does this wrong part come from?
11+
12+
Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321).
13+
14+
## General Approach in LingoDB
15+
To solve these challenges in LingoDB, we use a combination of location tracking, snapshotting, and alternative execution engines.
16+
17+
### Location Tracking in MLIR
18+
In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
19+
While it is possible to provide a *Unknown Location*, it should be avoided.
20+
When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
21+
When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
22+
**All passes in LingoDB ensure that correct locations are set afterwards.**
23+
24+
### Snapshotting
25+
MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
26+
Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
27+
28+
If enabled, LingoDB performs multiple location snapshots on multiple abstraction levels (in the current working directory):
29+
1. `input.mlir`: initial MLIR module that is e.g., produced from an SQL query
30+
2. `snapshot-0.mlir`: location snapshot after query optimization
31+
3. `snapshot-1.mlir`: location snapshot after lowering high-level operators to sub-operators
32+
4. `snapshot-2.mlir`: location snapshot after lowering sub-operators to imperative operations
33+
5. `snapshot-3.mlir`: location snapshot after lowering high-level imperative operations
34+
6. `snapshot-4.mlir`: final location snapshot of low-level IR (e.g., llvm dialect)
35+
36+
Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
37+
1. get the origin location of the current operation by looking in the appropriate snapshot file
38+
2. find the origin operation by going to this location
39+
40+
For example, if the debugger reports a problem (e.g. SEGFAULT) at `snapshot-4.mlir:1234`,
41+
* We first go to line `1234` of `snapshot-4.mlir` for the problematic operation and look at the corresponding location data (e.g., `snapshot-3.mlir:42`)
42+
* Next, we visit line `42` of `snapshot-3.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-2.mlir:13`)
43+
* Next, we visit line `13` of `snapshot-2.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-1.mlir:5`)
44+
* Finally, we visit line `5` of `snapshot-1.mlir` to find the 'problematic' sub-operator.
45+
46+
### Compiler Backends for Debugging
47+
In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
48+
49+
#### LLVM-Debug Backend
50+
Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
51+
This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
52+
During the execution, standard debuggers like `gdb` will then point to the corresponding operation in `snapshot-4.mlir`.
53+
This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
54+
55+
#### C++-Backend
56+
For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
57+
This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
58+
Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
59+
This shared library is then loaded with `dlopen` and the main function is called.
60+
Thus, the generated code can be debugged as any usual C++ program.
61+
To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
62+
63+
#### When to choose which backend?
64+
In most cases, choosing the C++-Backend is the better option, as it makes debugging much more user-friendly.
65+
However, there are two cases when the LLVM-Debug backend should be used:
66+
1. The C++-Backend may fail if unsupported MLIR operations are used for which no translation to C++ code is defined
67+
2. The behavior of the C++-Backend deviates from the previously expected behavior (e.g., in the case of a bug in the lowering to llvm).
68+
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
* All "non-standard" dependencies are packaged as python programs
2+
* We are building LLVM
3+
4+
* in `tools/mlir-package`:
5+
* `docker build -t mlir-package .`
6+
* `docker run -v ".:/built-packages" -v ".:/repo" --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Python Package
3+
---
4+
5+
Currently LingoDB is distributed as two seperate python packages:
6+
* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
7+
* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
8+
9+
## Working on `lingo-db`
10+
If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
11+
First install the current version of the `lingodb-bridge` package.
12+
```sh
13+
pip install lingodb-bridge
14+
```
15+
Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
16+
```sh
17+
cd tools/python
18+
python -m pip install -e .
19+
```
20+
For building a release package:
21+
```sh
22+
cd tools/python
23+
python -m build .
24+
```
25+
26+
## Building `lingodb-bridge`
27+
Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
28+
```sh
29+
make build-py-bridge PYVERSION=[VERSION]
30+
```
31+
where `[VERSION]` is one of:
32+
* `310`: for Python 3.10
33+
* `311`: for Python 3.11
34+
* `312`: for Python 3.12
35+
36+
This will then create a wheel in the current directory that can be installed, e.g.:
37+
```
38+
pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
39+
```
40+
41+

versioned_docs/version-0.0.1/ForDevelopers/Settings.md

Whitespace-only changes.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: Command Line Tools
3+
type: docs
4+
weight: 2
5+
---
6+
LingoDB comes with a few command line tools to simplify experimentation, development and debugging.
7+
8+
## Interactive SQL Shell
9+
```sh
10+
$ sql DBDIR
11+
sql> select 1
12+
```
13+
14+
Similar to other systems, LingoDB can also be used interactively using the `sql` binary that is pointed to a (possibly empty) directory that holds the database to be queried. Each query must be terminated by a `;`. By default, only a *read-only* session is created. For persistent changes enter `SET persist=1;`.
15+
16+
## Converting SQL to MLIR
17+
```sh
18+
$ sql-to-mlir SQL-File DBDIR
19+
```
20+
Using the `sql-to-mlir` tool, SQL queries can be converted to a corresponding, unoptimized MLIR module. As this requires the database schema, also the database directory must be provided.
21+
22+
## Performing Optimizations and Lowerings
23+
```sh
24+
$ mlir-db-opt [--use-db DBDIR] [Passes] MLIR-File
25+
```
26+
The `mlir-db-opt` command can be used to manually apply MLIR passes on a MLIR module provided by a file. For high-level optimizations that require e.g. database statistics, the database directory should be provided using the `--use-db` argument.
27+
28+
## Running MLIR Modules
29+
30+
```sh
31+
$ run-mlir MLIR-File [DBDIR]
32+
```
33+
MLIR modules can be executed using the `run-mlir` binary. A database directory can be provided as second argument.
34+
35+
## Running SQL queries
36+
37+
```sh
38+
$ run-sql SQL-File [DBDIR]
39+
```
40+
Single (read-only) SQL queries can be run with the `run-sql` utlity. If the query requires a database, the corresponding database directory must be provided as second argument.
41+
42+
43+
## The Trace of a Query
44+
With the following commands you can explore how a SQL query gets compiled layer by layer by looking at the different files:
45+
```sh
46+
# write example query to file
47+
$ echo "select * from studenten where name='Carnap'" > test.sql
48+
# translate sql to canonical MLIR module
49+
$ sql-to-mlir test.sql resources/data/uni/ > canonical.mlir
50+
# perform query optimization
51+
$ mlir-db-opt --use-db resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir
52+
# lower relational operators to sub-operators
53+
$ mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir
54+
# lower sub-operators to imperative code
55+
$ mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir
56+
# lower database-specific scalar operations
57+
$ mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir
58+
# lower mid-level abstraction (such as arrow tables) to low-level imperative code
59+
$ mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir
60+
```
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: Installation
3+
type: docs
4+
weight: 1
5+
---
6+
7+
## Python Package
8+
Install via pip, then use as [documented here](./Python.md)
9+
```
10+
pip install lingodb
11+
```
12+
13+
## Docker Image
14+
Either use the
15+
* [prebuilt docker image](https://github.com/lingo-db/lingo-db/pkgs/container/lingo-db)
16+
* or build the docker image yourself using `make build-docker`
17+
18+
The docker image then contains all the command line tools under `/build/lingodb/`
19+
20+
## Building from source
21+
1. Ensure you have a machine with sufficient compute power and space
22+
1. Make sure that you have the following build dependencies installed:
23+
1. Python3.10 or higher
24+
1. standard build tools, including `cmake` and `Ninja`
25+
1. Build LingoDB
26+
* Debug Version : `make build-debug` (will create binaries under `build/lingodb-debug`)
27+
* Release Version : `make build-release` (will create binaries under `build/lingodb-release`)
28+
1. Run test: `make run-test`
29+
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
title: Python Package
3+
weight: 3
4+
---
5+
## Installation
6+
```
7+
pip install lingodb
8+
```
9+
## Basic API Usage
10+
```py
11+
import lingodb
12+
# create connection
13+
con = lingodb.create_in_memory()
14+
# prints result as Apache Arrow table
15+
print(con.sql("select 42"))
16+
# pyarrow.Table
17+
#: int32
18+
#----
19+
#: [[42]]
20+
21+
# create table and insert rows
22+
con.sql_stmt("create table t (x bigint, y varchar(30) not null, z bigint not null, primary key (x))")
23+
con.sql_stmt("insert into t(x, y, z) values (1,'foo',42), (2,'bar',7)")
24+
25+
# execute query on table
26+
result = con.sql("select * from t where y='foo'")
27+
# convert to pandas using pyarrow
28+
df = result.to_pandas()
29+
print(df)
30+
# x y z
31+
# 0 1 foo 42
32+
33+
```
34+
35+
### Creating a Connection Object
36+
37+
A connection object can be created through two methods:
38+
* `lingodb.create_in_memory()`: creates a connection to an empty in-memory database, or by using
39+
* `lingodb.connect_to_db(DB_DIR)`: creates a connection to an database instance stored in the `DB_DIR` folder.
40+
41+
### Executing SQL statements
42+
SQL statements such as `CREATE TABLE`, `INSERT INTO`, `COPY FROM` and `SET ...` can be executed using the `sql_stmt` method.
43+
For example, if we want the database to be persistent (written back into the database directory), the following statement should be executed:
44+
45+
```py
46+
con = lingodb.connect_to_db("test-dir")
47+
48+
con.sql_stmt("SET persist=1")
49+
```
50+
51+
### Executing SQL queries
52+
Read-only SQL queries can be executed using the `sql` method.
53+
It will return an arrow table that can be used arbitrarily and e.g. converted to pandas.
54+
```py
55+
print(con.sql("select * from t").to_pandas())
56+
```
57+
58+
### Executing MLIR modules
59+
Similarly, also raw MLIR modules can be executed using the `mlir` method and produce an arrow table.
60+
```py
61+
62+
# select count(*) from t where x>2
63+
con.mlir("""module {
64+
func.func @main() {
65+
%0 = relalg.basetable {table_identifier = "t"} columns: { x => @t::@x({type = i64})}
66+
%1 = relalg.selection %0 (%arg0 : !tuples.tuple){
67+
%4 = db.constant(2) : i64
68+
%5 = tuples.getcol %arg0 @t::@x : i64
69+
%6 = db.compare gt %5 : i64, %4: i64
70+
tuples.return %6 : i1
71+
}
72+
%2 = relalg.aggregation %1 [] computes : [@aggr0::@tmp_attr0({type = i64})] (%arg0: !tuples.tuplestream,%arg1: !tuples.tuple){
73+
%4 = relalg.count %arg0
74+
tuples.return %4 : i64
75+
}
76+
%3 = relalg.materialize %2 [@aggr0::@tmp_attr0] => ["count"] : !subop.result_table<[count: i64]>
77+
subop.set_result 0 %3 : !subop.result_table<[count: i64]>
78+
return
79+
}
80+
}
81+
""")
82+
83+
```
84+
## Querying PyArrow Tables/Pandas DataFrames
85+
Apache Arrow tables can be imported using the `add_table` method. This also enables querying of pandas dataframes through pyarrow.
86+
87+
```py
88+
import pandas as pd
89+
df = pd.DataFrame(data={'col1': [1, 2, 3, 4], 'col2': ["foo", "foo", "bar", "bar"]})
90+
91+
con = lingodb.create_in_memory()
92+
#convert data frame to pyarrow table
93+
arrow_table=pa.Table.from_pandas(df)
94+
con.add_table("df",arrow_table)
95+
#query just as any other table
96+
print(con.sql("select * from df").to_pandas())
97+
```
98+
Furthermore, the rows of an pyarrow table can be added to an existing table (import) using the `append_table` method:
99+
```py
100+
con.append_table("df",arrow_table)
101+
```

0 commit comments

Comments
 (0)