Skip to content

Commit c55b470

Browse files
committed
Add papers
1 parent 7cfaa9d commit c55b470

10 files changed

+155
-0
lines changed

_science/2019-06-30-duckdb.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
layout: post
3+
title: "DuckDB: An Embeddable Analytical Database"
4+
author: "Hannes Mühleisen, Mark Raasveldt"
5+
thumb: "/images/science/thumbs/sigmod-2019.svg"
6+
image: "/images/science/thumbs/sigmod-2019.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://hannes.muehleisen.org/publications/SIGMOD2019-demo-duckdb.pdf)
12+
13+
Venue: SIGMOD 2019
14+
15+
## Abstract
16+
17+
The immense popularity of SQLite shows that there is a need for unobtrusive in-process data management solutions. However, there is no such system yet geared towards analytical workloads. We demonstrate DuckDB, a novel data management system designed to execute analytical SQL queries while embedded in another process. In our demonstration, we pit DuckDB against other data management solutions to showcase its performance in the embedded analytics scenario. DuckDB is available as Open Source software under a permissive license.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
layout: post
3+
title: "Data Management for Data Science Towards Embedded Analytics"
4+
author: "Hannes Mühleisen, Mark Raasveldt"
5+
thumb: "/images/science/thumbs/cidr-2020.svg"
6+
image: "/images/science/thumbs/cidr-2020.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://hannes.muehleisen.org/publications/CIDR2020-raasveldt-muehleisen-duckdb.pdf)
12+
13+
Venue: CIDR 2020
14+
15+
## Abstract
16+
17+
The rise of Data Science has caused an influx of new users in need of data management solutions. However, instead of utilizing existing RDBMS solutions they are opting to use a stack of independent solutions for data storage and processing glued together by scripting languages. This is not because they do not need the functionality that an integrated RDBMS provides, but rather because existing RDBMS implementations do not cater to their use case. To solve these issues, we propose a new class of data management systems: embedded analytical systems. These systems are tightly integrated with analytical tools, and provide fast and efficient access to the data stored within them. In this work, we describe the unique challenges and opportunities w.r.t. workloads, resilience and cooperation that are faced by this new class of systems and the steps we have taken towards addressing them in the DuckDB system.

_science/2023-04-03-sorting.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
layout: post
3+
title: "These Rows Are Made for Sorting and That's Just What We'll Do"
4+
author: "Laurens Kuiper, Hannes Muhleisen ¨"
5+
thumb: "/images/science/thumbs/icde-2023.svg"
6+
image: "/images/science/thumbs/icde-2023.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf)
12+
13+
## Abstract
14+
15+
Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this work, we aim to fill this gap. We use micro-benchmarks to explore how to sort relational data efficiently for analytical database systems, taking into account different query execution engines as well as row and columnar data formats. We show that, regardless of architectural differences between query engines, sorting rows is almost always more efficient than sorting columnar data, even if this requires converting the data from columns to rows and back. Sorting rows efficiently is challenging for systems with an interpreted execution engine, as interpreting rows at runtime causes overhead. We show that this overhead can be overcome with several existing techniques. Based on our findings, we implement a highly optimized row-based sorting approach in the DuckDB open-source in-process analytical database management system, which has a vectorized interpreted query engine. We compare DuckDB with four analytical database systems and find that DuckDB’s sort implementation outperforms query engines that sort using a columnar data format, and matches or outperforms compiled query engines that sort using a row data format.

_science/2024-03-25-fp-compression.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
---
2+
layout: post
3+
title: "How to Make your Duck Fly: Advanced Floating Point Compression to the Rescue"
4+
author: "Panagiotis Liakos, Katia Papakonstantinopoulou, Thijs Bruineman, Mark Raasveldt, Yannis Kotidis"
5+
thumb: "/images/science/thumbs/edbt-2024.svg"
6+
image: "/images/science/thumbs/edbt-2024.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://openproceedings.org/2024/conf/edbt/paper-248.pdf)
12+
13+
## Abstract
14+
15+
The massive volumes of data generated in diverse domains, such as scientific computing, finance and environmental monitoring, hinder our ability to perform multidimensional analysis at high speeds and also yield significant storage and egress costs. Applying compression algorithms to reduce these costs is particularly suitable for column-oriented DBMSs, as the values of individual columns are usually similar and thus, allow for effective compression. However, this has not been the case for binary floating point numbers, as the space savings achieved by respective compression algorithms are usually very modest. We present here two lossless compression algorithms for floating point data, termed Chimp and Patas, that attain impressive compression ratios and greatly outperform state-of-the-art approaches. We focus on how these two algorithms impact the performance of DuckDB, a purpose-built embeddable database for interactive analytics. Our demonstration will showcase how our novel compression approaches _a)_ reduce storage requirements, and _b)_ improve the time needed to load and query data using DuckDB.
16+
17+
## Implementation
18+
19+
The Chimp and Patas compression algorithms are both supported in DuckDB.

_science/2024-06-09-alp.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
layout: post
3+
title: "ALP: Adaptive Lossless floating-Point Compression"
4+
author: "Azim Afroozeh, Leonardo X. Kuffo, Peter A. Boncz"
5+
thumb: "/images/science/thumbs/sigmod-2024.svg"
6+
image: "/images/science/thumbs/sigmod-2024.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://ir.cwi.nl/pub/33334/33334.pdf)
12+
13+
Venue: SIGMOD 2024
14+
15+
## Abstract
16+
17+
IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as Zstd; while [de]compression is much slower than Delta and FOR. We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio. We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals [31] to losslessly encode doubles as integers if they originated as decimals, and otherwise uses vectorized compression of the doubles’ front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, using building blocks provided by our FastLanes library [6], and an efficient two-stage compression algorithm that first samples row-groups and then vectors.
18+
19+
## Implementation
20+
21+
The ALP algorithm is the default compression method for floating-point numbers in DuckDB.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
---
2+
layout: post
3+
title: "Runtime-Extensible Parsers"
4+
author: Hannes Mühleisen, Mark Raasveldt"
5+
thumb: "/images/science/thumbs/cidr-2025.svg"
6+
image: "/images/science/thumbs/cidr-2025.png"
7+
excerpt: ""
8+
tags: ["Paper"]
9+
---
10+
11+
[Paper (PDF)](https://vldb.org/cidrdb/papers/2025/p18-muhleisen.pdf)
12+
13+
Venue: CIDR 2025
14+
15+
## Abstract
16+
17+
Despite their central role in processing queries, parsers have not received any noticeable attention in the data systems space. State-of-the art systems are content with ancient old parser generators. These generators create monolithic, inflexible and unforgiving parsers that hinder innovation in query languages and frustrate users. We argue that parsers should be rewritten using modern abstractions like Parser Expression Grammars (PEG), which allow dynamic changes to the accepted query syntax and better error recovery. In this paper, we discuss how parsers could be re-designed using PEG, and validate our recommendations using experiments for both effectiveness and efficiency.
18+
19+
## Implementation
20+
21+
DuckDB's autocomplete is implemented using a PEG-based parser.
22+
There is ongoing work to rewrite DuckDB's current PostgreSQL-based parser using a PEG-based parser.

images/science/thumbs/cidr-2020.png

23.1 KB
Loading

images/science/thumbs/cidr-2020.svg

Lines changed: 22 additions & 0 deletions
Loading

images/science/thumbs/sigmod-2024.png

25 KB
Loading

images/science/thumbs/sigmod-2024.svg

Lines changed: 22 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)