lingo-db
diff --git a/‎docusaurus.config.js‎
Lines changed: 5 additions & 0 deletions b/‎docusaurus.config.js‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions b/‎versioned_docs/version-0.0.1/Design/Storage.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/Design/_index.md‎
Lines changed: 7 additions & 0 deletions b/‎versioned_docs/version-0.0.1/Design/_index.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/ForDevelopers/Debugging.md‎
Lines changed: 68 additions & 0 deletions b/‎versioned_docs/version-0.0.1/ForDevelopers/Debugging.md‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/ForDevelopers/Dependencies.md‎
Lines changed: 6 additions & 0 deletions b/‎versioned_docs/version-0.0.1/ForDevelopers/Dependencies.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions b/‎versioned_docs/version-0.0.1/ForDevelopers/PythonPackage.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/ForDevelopers/Settings.md‎ b/‎versioned_docs/version-0.0.1/ForDevelopers/Settings.md‎
diff --git a/‎versioned_docs/version-0.0.1/GettingStarted/CommandLineTools.md‎
Lines changed: 60 additions & 0 deletions b/‎versioned_docs/version-0.0.1/GettingStarted/CommandLineTools.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/GettingStarted/Install.md‎
Lines changed: 29 additions & 0 deletions b/‎versioned_docs/version-0.0.1/GettingStarted/Install.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎versioned_docs/version-0.0.1/GettingStarted/Python.md‎
Lines changed: 101 additions & 0 deletions b/‎versioned_docs/version-0.0.1/GettingStarted/Python.md‎
Lines changed: 101 additions & 0 deletions
@@ -75,6 +75,11 @@ const config = {
             position: 'left',
             label: 'Docs',
           },
+          {
+            type: 'docsVersionDropdown',
+            position: 'right',
+            dropdownActiveClassDisabled: true,
+          },
           //{to: '/blog', label: 'Blog', position: 'left'},
           {
             href: 'https://github.com/lingo-db/lingo-db',
 
@@ -0,0 +1,22 @@
+---
+title: Storage
+weight: 1
+---
+The research conducted with LingoDB does not focus on storage aspects of database systems.
+Thus, LingoDB does not come with an optimized storage backend and currently does not provide transactional semantics.
+
+## In-Memory Format: Apache Arrow
+The Apache Arrow columnar layout is used for the in-memory representation of tabular data.
+Thus, LingoDB can exchange data with existing libraries and frameworks withoug any overhead and can directly query Apache Arrow tables.
+
+## Persistent Storage
+For many practical purposes, persistent storage is required.
+We chose a pragmatic approach:
+
+1. Each database is represented by multiple files placed in one *database directory*
+2. In this directory, each table is represented by multiple files, each starting with the name of the table:
+    1. *name*`.metadata.json`: stores metadata relevant to LingoDB. This includes basic informations like column names and internal column types, but also statistics and available indices
+    2. *name*`.arrow`: Stores the contents of the table using Apache Arrow's IPC-Format
+    3. *name*`.arrow.sample`: Optionally stores an sample of up to 1024 rows randomly selected from the table.
+
+Given the database directory, LingoDB automatically detects the available tables, loads the metadata, data, and samples.
@@ -0,0 +1,7 @@
+---
+title: Design
+type: docs
+weight: 4
+---
+
+This section gives an overview over the overall design of LingoDB.
@@ -0,0 +1,68 @@
+---
+title: Debugging
+---
+
+Compared to interpreted execution engines, compiling engines come with many advantages but also some challenges.
+Especially debugging can become a challenge, as one not only needs to debug the engine code, but also the generated code.
+When debugging generated code typically two main questions arise:
+
+1. Where exactly is the generated code wrong?
+2. Where does this wrong part come from?
+
+Possible solutions to these problems have been discussed before for debugging [Hyper](https://ieeexplore.ieee.org/document/8667737) and [Umbra](https://dl.acm.org/doi/abs/10.1145/3395032.3395321).
+
+## General Approach in LingoDB
+To solve these challenges in LingoDB, we use a combination of location tracking, snapshotting, and alternative execution engines.
+
+### Location Tracking in MLIR
+In MLIR, every operation is associated with a *Location*, that must be provided during operation creation.
+While it is possible to provide a *Unknown Location*, it should be avoided.
+When parsing a MLIR file, MLIR automatically annotates the parsed operations with the corresponding file locations.
+When new operations are created during a pass they are usually annotated with the location of the current operation that is transformed or lowered.
+**All passes in LingoDB ensure that correct locations are set afterwards.**
+
+### Snapshotting
+MLIR already comes with a `LocationSnapshotPass` that takes an operation (e.g. a MLIR Module) and writes it to disk, including the annotated locations.
+Then, this file is now read back in, now annotating the locations *according to the location inside this newly written file*.
+
+If enabled, LingoDB performs multiple location snapshots on multiple abstraction levels (in the current working directory):
+1. `input.mlir`: initial MLIR module that is e.g., produced from an SQL query
+2. `snapshot-0.mlir`: location snapshot after query optimization
+3. `snapshot-1.mlir`: location snapshot after lowering high-level operators to sub-operators
+4. `snapshot-2.mlir`: location snapshot after lowering sub-operators to imperative operations
+5. `snapshot-3.mlir`: location snapshot after lowering high-level imperative operations
+6. `snapshot-4.mlir`: final location snapshot of low-level IR (e.g., llvm dialect)
+
+Using this snapshot files, we can track the origin of any operation, by recursively following the following steps
+1. get the origin location of the current operation by looking in the appropriate snapshot file
+2. find the origin operation by going to this location
+
+For example, if the debugger reports a problem (e.g. SEGFAULT) at `snapshot-4.mlir:1234`,
+* We first go to line `1234` of `snapshot-4.mlir` for the problematic operation and look at the corresponding location data (e.g., `snapshot-3.mlir:42`)
+* Next, we visit line `42` of `snapshot-3.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-2.mlir:13`)
+* Next, we visit line `13` of `snapshot-2.mlir` to find the corresponding higher-level operation and look at the corresponding location data (e.g., `snapshot-1.mlir:5`)
+* Finally, we visit line `5` of `snapshot-1.mlir` to find the 'problematic' sub-operator.
+
+### Compiler Backends for Debugging
+In addition to location tracking and snapshotting, LingoDB implements two special compiler backends for debugging.
+
+#### LLVM-Debug Backend
+Instead of using the standard LLVM backend, another LLVM-based backend can be used that adds debug information and performs no optimizations.
+This backend is selected by setting the environment variable `LINGODB_EXECUTION_MODE=DEBUGGING`.
+During the execution, standard debuggers like `gdb` will then point to the corresponding operation in `snapshot-4.mlir`.
+This enables basic tracking of problematic operations, but advanced debugging will remain difficult.
+
+#### C++-Backend
+For more advanced debugging, a *C++-Backend* can be used by setting `LINGODB_EXECUTION_MODE=C`.
+This backend directly translates a fixed set of low-level generic MLIR operations to C++ statements and functions that are written to a file called `mlir-c-module.cpp`.
+Next, LingoDB automatically invokes `clang++` (must be installed!) with `-O0` and `-g` to compile this C++ file into a shared library with debug informations.
+This shared library is then loaded with `dlopen` and the main function is called.
+Thus, the generated code can be debugged as any usual C++ program.
+To help with tracking an error to higher-level MLIR operations, each C++ statement is preceeded with a comment containing the original operation and it's location.
+
+#### When to choose which backend?
+In most cases, choosing the C++-Backend is the better option, as it makes debugging much more user-friendly.
+However, there are two cases when the LLVM-Debug backend should be used:
+1. The C++-Backend may fail if unsupported MLIR operations are used for which no translation to C++ code is defined
+2. The behavior of the C++-Backend deviates from the previously expected behavior (e.g., in the case of a bug in the lowering to llvm).
+
@@ -0,0 +1,6 @@
+* All "non-standard" dependencies are packaged as python programs
+* We are building LLVM
+
+* in `tools/mlir-package`:  
+    * `docker build -t mlir-package .`
+    *  `docker run -v ".:/built-packages" -v ".:/repo"  --rm -it mlir-package /usr/bin/create_package.sh cp312-cp312`
@@ -0,0 +1,41 @@
+---
+title: Python Package
+---
+
+Currently LingoDB is distributed as two seperate python packages: 
+* `lingodb-bridge`: bundles LingoDB as a binary and implements a basic integration using pybind11
+* `lingodb`: a python-only library that wraps `lingodb-bridge` and provides a nice interface (and much more in the future)
+
+## Working on `lingo-db`
+If you only plan to adapt/extend the python implementation, you do not have to build the `lingodb-bridge` package yourselve.
+First install the current version of the `lingodb-bridge` package.
+```sh
+pip install lingodb-bridge
+```
+Then, install the package in *development mode* so that you can just change the code (`tools/python/lingodb`) and directly test the changes:
+```sh
+cd tools/python
+python -m pip install -e .
+```
+For building a release package:
+```sh
+cd tools/python
+python -m build .
+```
+
+## Building `lingodb-bridge`
+Building a python binary wheel is non-trivial but becomes easy with the docker image we prepared. Just execute the following commands at the repository's root:
+```sh
+make build-py-bridge PYVERSION=[VERSION]
+```
+where `[VERSION]` is one of:
+* `310`: for Python 3.10
+* `311`: for Python 3.11
+* `312`: for Python 3.12
+
+This will then create a wheel in the current directory that can be installed, e.g.:
+```
+pip install lingodb_bridge-0.0.0-cp310-cp310-manylinux_2_28_x86_64.whl
+```
+
+
@@ -0,0 +1,60 @@
+---
+title: Command Line Tools
+type: docs
+weight: 2
+---
+LingoDB comes with a few command line tools to simplify experimentation, development and debugging.
+
+## Interactive SQL Shell
+```sh
+$ sql DBDIR
+sql> select 1
+```
+
+Similar to other systems, LingoDB can also be used interactively using the `sql` binary that is pointed to a (possibly empty) directory that holds the database to be queried. Each query must be terminated by a `;`. By default, only a *read-only* session is created. For persistent changes enter `SET persist=1;`.
+
+## Converting SQL to MLIR
+```sh
+$ sql-to-mlir SQL-File DBDIR
+```
+Using the `sql-to-mlir` tool, SQL queries can be converted to a corresponding, unoptimized MLIR module. As this requires the database schema, also the database directory must be provided.
+
+## Performing Optimizations and Lowerings
+```sh
+$ mlir-db-opt [--use-db DBDIR] [Passes] MLIR-File
+```
+The `mlir-db-opt` command can be used to manually apply MLIR passes on a MLIR module provided by a file. For high-level optimizations that require e.g. database statistics, the database directory should be provided using the `--use-db` argument.
+
+## Running MLIR Modules
+
+```sh
+$ run-mlir MLIR-File [DBDIR]
+```
+MLIR modules can be executed using the `run-mlir` binary. A database directory can be provided as second argument.
+
+## Running SQL queries
+
+```sh
+$ run-sql SQL-File [DBDIR]
+```
+Single (read-only) SQL queries can be run with the `run-sql` utlity. If the query requires a database, the corresponding database directory must be provided as second argument.
+
+
+## The Trace of a Query
+With the following commands you can explore how a SQL query gets compiled layer by layer by looking at the different files:
+```sh
+# write example query to file
+$ echo "select * from studenten where name='Carnap'" > test.sql
+# translate sql to canonical MLIR module
+$ sql-to-mlir test.sql resources/data/uni/ > canonical.mlir
+# perform query optimization
+$ mlir-db-opt --use-db resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir
+# lower relational operators to sub-operators
+$ mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir
+# lower sub-operators to imperative code
+$ mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir
+# lower database-specific scalar operations
+$ mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir
+# lower mid-level abstraction (such as arrow tables) to low-level imperative code
+$ mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir
+```
@@ -0,0 +1,29 @@
+---
+title: Installation
+type: docs
+weight: 1
+---
+
+## Python Package
+Install via pip, then use as [documented here](./Python.md)
+```
+pip install lingodb
+```
+
+## Docker Image
+Either use the 
+* [prebuilt docker image](https://github.com/lingo-db/lingo-db/pkgs/container/lingo-db)
+* or build the docker image yourself using `make build-docker`
+
+The docker image then contains all the command line tools under `/build/lingodb/`
+
+## Building from source
+1. Ensure you have a machine with sufficient compute power and space
+1. Make sure that you have the following build dependencies installed:
+    1. Python3.10 or higher
+    1. standard build tools, including `cmake` and `Ninja`
+1. Build LingoDB
+    * Debug Version : `make build-debug` (will create binaries under `build/lingodb-debug`)
+    * Release Version : `make build-release` (will create binaries under `build/lingodb-release`)
+1. Run test: `make run-test`
+
@@ -0,0 +1,101 @@
+---
+title: Python Package
+weight: 3
+---
+## Installation
+```
+pip install lingodb
+```
+## Basic API Usage
+```py
+import lingodb
+# create connection
+con = lingodb.create_in_memory()
+# prints result as Apache Arrow table
+print(con.sql("select 42"))
+# pyarrow.Table
+#: int32
+#----
+#: [[42]]
+
+# create table and insert rows
+con.sql_stmt("create table t (x bigint, y varchar(30) not null, z bigint not null, primary key (x))")
+con.sql_stmt("insert into t(x, y, z) values (1,'foo',42), (2,'bar',7)")
+
+# execute query on table
+result = con.sql("select * from t where y='foo'")
+# convert to pandas using pyarrow
+df = result.to_pandas()
+print(df)
+#   x    y   z
+# 0  1  foo  42
+
+```
+
+### Creating a Connection Object
+
+A connection object can be created through two methods:
+* `lingodb.create_in_memory()`: creates a connection to an empty in-memory database, or by using
+* `lingodb.connect_to_db(DB_DIR)`: creates a connection to an database instance stored in the `DB_DIR` folder. 
+
+### Executing SQL statements
+SQL statements such as `CREATE TABLE`, `INSERT INTO`, `COPY FROM` and `SET ...` can be executed using the `sql_stmt` method.
+For example, if we want the database to be persistent (written back into the database directory), the following statement should be executed:
+
+```py
+con = lingodb.connect_to_db("test-dir")
+
+con.sql_stmt("SET persist=1")
+```
+
+### Executing SQL queries
+Read-only SQL queries can be executed using the `sql` method.
+It will return an arrow table that can be used arbitrarily and e.g. converted to pandas.
+```py
+print(con.sql("select * from t").to_pandas())
+```
+
+### Executing MLIR modules
+Similarly, also raw MLIR modules can be executed using the `mlir` method and produce an arrow table.
+```py
+
+# select count(*) from t where x>2
+con.mlir("""module {
+  func.func @main() {
+    %0 = relalg.basetable {table_identifier = "t"} columns: { x => @t::@x({type = i64})}
+    %1 = relalg.selection %0 (%arg0 : !tuples.tuple){
+        %4 = db.constant(2) : i64
+        %5 = tuples.getcol %arg0 @t::@x : i64
+        %6 = db.compare gt %5 : i64, %4: i64
+        tuples.return %6 : i1
+    }
+    %2 = relalg.aggregation %1 [] computes : [@aggr0::@tmp_attr0({type = i64})] (%arg0: !tuples.tuplestream,%arg1: !tuples.tuple){
+      %4 = relalg.count %arg0
+      tuples.return %4 : i64
+    }
+    %3 = relalg.materialize %2 [@aggr0::@tmp_attr0] => ["count"] : !subop.result_table<[count: i64]>
+    subop.set_result 0 %3 :  !subop.result_table<[count: i64]>
+    return
+  }
+}
+""")
+
+```
+## Querying PyArrow Tables/Pandas DataFrames
+Apache Arrow tables can be imported using the `add_table` method. This also enables querying of pandas dataframes through pyarrow.
+
+```py
+import pandas as pd
+df = pd.DataFrame(data={'col1': [1, 2, 3, 4], 'col2': ["foo", "foo", "bar", "bar"]})
+
+con = lingodb.create_in_memory()
+#convert data frame to pyarrow table
+arrow_table=pa.Table.from_pandas(df)
+con.add_table("df",arrow_table)
+#query just as any other table
+print(con.sql("select * from df").to_pandas())
+```
+Furthermore, the rows of an pyarrow table can be added to an existing table (import) using the `append_table` method:
+```py
+con.append_table("df",arrow_table)
+```