Add papers

szarnyasg · szarnyasg · commit c55b47088196 · 2025-09-03T15:57:04.000+02:00
diff --git a/_science/2019-06-30-duckdb.md b/_science/2019-06-30-duckdb.md
@@ -0,0 +1,17 @@
+---
+layout: post
+title: "DuckDB: An Embeddable Analytical Database"
+author: "Hannes Mühleisen, Mark Raasveldt"
+thumb: "/images/science/thumbs/sigmod-2019.svg"
+image: "/images/science/thumbs/sigmod-2019.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://hannes.muehleisen.org/publications/SIGMOD2019-demo-duckdb.pdf)
+
+Venue: SIGMOD 2019
+
+## Abstract
+
+The immense popularity of SQLite shows that there is a need for unobtrusive in-process data management solutions. However, there is no such system yet geared towards analytical workloads. We demonstrate DuckDB, a novel data management system designed to execute analytical SQL queries while embedded in another process. In our demonstration, we pit DuckDB against other data management solutions to showcase its performance in the embedded analytics scenario. DuckDB is available as Open Source software under a permissive license.
diff --git a/_science/2020-01-12-embedded-analytics.md b/_science/2020-01-12-embedded-analytics.md
@@ -0,0 +1,17 @@
+---
+layout: post
+title: "Data Management for Data Science Towards Embedded Analytics"
+author: "Hannes Mühleisen, Mark Raasveldt"
+thumb: "/images/science/thumbs/cidr-2020.svg"
+image: "/images/science/thumbs/cidr-2020.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://hannes.muehleisen.org/publications/CIDR2020-raasveldt-muehleisen-duckdb.pdf)
+
+Venue: CIDR 2020
+
+## Abstract
+
+The rise of Data Science has caused an influx of new users in need of data management solutions. However, instead of utilizing existing RDBMS solutions they are opting to use a stack of independent solutions for data storage and processing glued together by scripting languages. This is not because they do not need the functionality that an integrated RDBMS provides, but rather because existing RDBMS implementations do not cater to their use case. To solve these issues, we propose a new class of data management systems: embedded analytical systems. These systems are tightly integrated with analytical tools, and provide fast and efficient access to the data stored within them. In this work, we describe the unique challenges and opportunities w.r.t. workloads, resilience and cooperation that are faced by this new class of systems and the steps we have taken towards addressing them in the DuckDB system.
diff --git a/_science/2023-04-03-sorting.md b/_science/2023-04-03-sorting.md
@@ -0,0 +1,15 @@
+---
+layout: post
+title: "These Rows Are Made for Sorting and That's Just What We'll Do"
+author: "Laurens Kuiper, Hannes Muhleisen ¨"
+thumb: "/images/science/thumbs/icde-2023.svg"
+image: "/images/science/thumbs/icde-2023.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://duckdb.org/pdf/ICDE2023-kuiper-muehleisen-sorting.pdf)
+
+## Abstract
+
+Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this work, we aim to fill this gap. We use micro-benchmarks to explore how to sort relational data efficiently for analytical database systems, taking into account different query execution engines as well as row and columnar data formats. We show that, regardless of architectural differences between query engines, sorting rows is almost always more efficient than sorting columnar data, even if this requires converting the data from columns to rows and back. Sorting rows efficiently is challenging for systems with an interpreted execution engine, as interpreting rows at runtime causes overhead. We show that this overhead can be overcome with several existing techniques. Based on our findings, we implement a highly optimized row-based sorting approach in the DuckDB open-source in-process analytical database management system, which has a vectorized interpreted query engine. We compare DuckDB with four analytical database systems and find that DuckDB’s sort implementation outperforms query engines that sort using a columnar data format, and matches or outperforms compiled query engines that sort using a row data format.
diff --git a/_science/2024-03-25-fp-compression.md b/_science/2024-03-25-fp-compression.md
@@ -0,0 +1,19 @@
+---
+layout: post
+title: "How to Make your Duck Fly: Advanced Floating Point Compression to the Rescue"
+author: "Panagiotis Liakos, Katia Papakonstantinopoulou, Thijs Bruineman, Mark Raasveldt, Yannis Kotidis"
+thumb: "/images/science/thumbs/edbt-2024.svg"
+image: "/images/science/thumbs/edbt-2024.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://openproceedings.org/2024/conf/edbt/paper-248.pdf)
+
+## Abstract
+
+The massive volumes of data generated in diverse domains, such as scientific computing, finance and environmental monitoring, hinder our ability to perform multidimensional analysis at high speeds and also yield significant storage and egress costs. Applying compression algorithms to reduce these costs is particularly suitable for column-oriented DBMSs, as the values of individual columns are usually similar and thus, allow for effective compression. However, this has not been the case for binary floating point numbers, as the space savings achieved by respective compression algorithms are usually very modest. We present here two lossless compression algorithms for floating point data, termed Chimp and Patas, that attain impressive compression ratios and greatly outperform state-of-the-art approaches. We focus on how these two algorithms impact the performance of DuckDB, a purpose-built embeddable database for interactive analytics. Our demonstration will showcase how our novel compression approaches _a)_ reduce storage requirements, and _b)_ improve the time needed to load and query data using DuckDB.
+
+## Implementation
+
+The Chimp and Patas compression algorithms are both supported in DuckDB.
diff --git a/_science/2024-06-09-alp.md b/_science/2024-06-09-alp.md
@@ -0,0 +1,21 @@
+---
+layout: post
+title: "ALP: Adaptive Lossless floating-Point Compression"
+author: "Azim Afroozeh, Leonardo X. Kuffo, Peter A. Boncz"
+thumb: "/images/science/thumbs/sigmod-2024.svg"
+image: "/images/science/thumbs/sigmod-2024.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://ir.cwi.nl/pub/33334/33334.pdf)
+
+Venue: SIGMOD 2024
+
+## Abstract
+
+IEEE 754 doubles do not exactly represent most real values, introducing rounding errors in computations and [de]serialization to text. These rounding errors inhibit the use of existing lightweight compression schemes such as Delta and Frame Of Reference (FOR), but recently new schemes were proposed: Gorilla, Chimp128, PseudoDecimals (PDE), Elf and Patas. However, their compression ratios are not better than those of general-purpose compressors such as Zstd; while [de]compression is much slower than Delta and FOR. We propose and evaluate ALP, that significantly improves these previous schemes in both speed and compression ratio. We created ALP after carefully studying the datasets used to evaluate the previous schemes. To obtain speed, ALP is designed to fit vectorized execution. This turned out to be key for also improving the compression ratio, as we found in-vector commonalities to create compression opportunities. ALP is an adaptive scheme that uses a strongly enhanced version of PseudoDecimals [31] to losslessly encode doubles as integers if they originated as decimals, and otherwise uses vectorized compression of the doubles’ front bits. Its high speeds stem from our implementation in scalar code that auto-vectorizes, using building blocks provided by our FastLanes library [6], and an efficient two-stage compression algorithm that first samples row-groups and then vectors.
+
+## Implementation
+
+The ALP algorithm is the default compression method for floating-point numbers in DuckDB.
diff --git a/_science/2025-01-19-runtime-extensible-parsers.md b/_science/2025-01-19-runtime-extensible-parsers.md
@@ -0,0 +1,22 @@
+---
+layout: post
+title: "Runtime-Extensible Parsers"
+author: Hannes Mühleisen, Mark Raasveldt"
+thumb: "/images/science/thumbs/cidr-2025.svg"
+image: "/images/science/thumbs/cidr-2025.png"
+excerpt: ""
+tags: ["Paper"]
+---
+
+[Paper (PDF)](https://vldb.org/cidrdb/papers/2025/p18-muhleisen.pdf)
+
+Venue: CIDR 2025
+
+## Abstract
+
+Despite their central role in processing queries, parsers have not received any noticeable attention in the data systems space. State-of-the art systems are content with ancient old parser generators. These generators create monolithic, inflexible and unforgiving parsers that hinder innovation in query languages and frustrate users. We argue that parsers should be rewritten using modern abstractions like Parser Expression Grammars (PEG), which allow dynamic changes to the accepted query syntax and better error recovery. In this paper, we discuss how parsers could be re-designed using PEG, and validate our recommendations using experiments for both effectiveness and efficiency.
+
+## Implementation
+
+DuckDB's autocomplete is implemented using a PEG-based parser.
+There is ongoing work to rewrite DuckDB's current PostgreSQL-based parser using a PEG-based parser.
diff --git a/images/science/thumbs/cidr-2020.png b/images/science/thumbs/cidr-2020.png
diff --git a/images/science/thumbs/cidr-2020.svg b/images/science/thumbs/cidr-2020.svg
@@ -0,0 +1,22 @@
+<svg width="367" height="367" viewBox="0 0 367 367" fill="none" xmlns="http://www.w3.org/2000/svg">
+<g clip-path="url(#clip0_7501_33965)">
+<rect width="367" height="367" fill="#A899FF"/>
+<g clip-path="url(#clip1_7501_33965)">
+<rect width="367" height="367" fill="#A899FF"/>
+<path d="M73.4002 319.22L252.313 319.22C259.552 319.22 265.421 313.352 265.421 306.113L265.421 255.978C265.421 248.739 271.289 242.871 278.528 242.871L309.657 242.871C316.896 242.871 322.764 237.003 322.764 229.764L322.764 198.307C322.764 191.068 328.632 185.2 335.871 185.2L381.419 185.2C388.658 185.2 394.526 191.068 394.526 198.307L394.526 410.97C394.526 418.209 388.658 424.077 381.419 424.077L73.4002 424.077C66.1614 424.077 60.2931 418.209 60.2931 410.97L60.2931 332.327C60.2931 325.089 66.1613 319.22 73.4002 319.22Z" fill="#9787F2"/>
+<path d="M32.4401 282.786L-49.152 282.786C-56.3908 282.786 -62.2591 276.918 -62.2591 269.679L-62.2591 -42.5984C-62.2591 -49.8373 -56.3909 -55.7056 -49.152 -55.7056L277.871 -55.7056C285.11 -55.7056 290.978 -49.8374 290.978 -42.5985L290.978 26.8695C290.978 34.1084 285.11 39.9766 277.871 39.9766L181.37 39.9767C174.131 39.9767 168.263 45.845 168.263 53.0838L168.263 109.772C168.263 117.011 162.395 122.879 155.156 122.879L58.6544 122.879C51.4155 122.879 45.5473 128.748 45.5473 135.986L45.5473 269.679C45.5473 276.918 39.679 282.786 32.4401 282.786Z" fill="#9787F2"/>
+<path d="M-50.4624 -24.2488L-50.4624 64.552C-50.4624 71.7909 -44.5942 77.6592 -37.3553 77.6592L71.1065 77.6592C78.3453 77.6592 84.2136 71.7909 84.2136 64.552L84.2136 47.5128C84.2136 40.2739 90.0819 34.4057 97.3208 34.4057L131.072 34.4057C138.31 34.4057 144.179 28.5374 144.179 21.2985L144.179 -24.2487C144.179 -31.4876 138.31 -37.3559 131.072 -37.3559L-37.3553 -37.3559C-44.5942 -37.3559 -50.4624 -31.4876 -50.4624 -24.2488Z" fill="#A899FF"/>
+<path d="M322.927 104.857L322.927 -118.619C322.927 -125.858 328.796 -131.727 336.034 -131.727L381.254 -131.727C388.493 -131.727 394.361 -125.858 394.361 -118.619L394.361 104.857C394.361 112.096 388.493 117.964 381.254 117.964L336.034 117.964C328.796 117.964 322.927 112.096 322.927 104.857Z" fill="#9787F2"/>
+</g>
+<rect x="99" y="155" width="168.287" height="56.8021" fill="black"/>
+<path d="M142.472 186.157H145.239C144.853 189.148 142.371 191.265 138.85 191.265C134.456 191.265 131.953 188.274 131.953 183.553C131.953 178.894 134.639 175.984 138.932 175.984C142.391 175.984 144.751 177.958 145.239 181.03H142.472C141.923 178.975 140.539 178.303 138.912 178.303C136.694 178.303 134.964 180.135 134.964 183.553C134.964 187.073 136.694 188.945 138.81 188.945C140.56 188.945 141.984 188.172 142.472 186.157ZM149.996 176.248V191H147.086V176.248H149.996ZM162.213 183.614C162.213 179.199 159.588 178.527 156.902 178.527H155.559V188.721H156.902C159.588 188.721 162.213 188.07 162.213 183.614ZM152.65 176.248H157.024C161.704 176.248 165.224 177.754 165.224 183.614C165.224 189.474 161.704 191 157.024 191H152.65V176.248ZM176.206 191L173.276 185.221H172.585H170.184V191H167.274V176.248H172.117C176.98 176.248 178.831 177.632 178.831 180.725C178.831 182.434 178.017 183.818 176.166 184.611L179.441 191H176.206ZM172.442 178.527H170.184V182.963H172.585C174.863 182.963 175.84 182.21 175.84 180.745C175.84 179.036 174.416 178.527 172.442 178.527ZM190.436 175.984C193.386 175.984 195.278 177.368 195.278 179.952C195.278 182.698 192.593 185.608 188.055 188.64H195.624V191H184.779V188.721C189.52 184.916 192.45 182.434 192.45 180.094C192.45 178.792 191.657 178.059 190.273 178.059C189.032 178.059 187.77 178.812 187.77 181.01H185.044C184.983 177.978 187.18 175.984 190.436 175.984ZM202.85 191.265C198.902 191.265 196.908 188.07 196.908 183.594C196.908 179.117 198.902 175.984 202.85 175.984C206.756 175.984 208.771 179.117 208.771 183.594C208.771 188.07 206.756 191.265 202.85 191.265ZM202.85 189.148C204.864 189.148 205.902 187.154 205.902 183.594C205.902 180.033 204.864 178.1 202.85 178.1C200.835 178.1 199.777 180.033 199.777 183.594C199.777 187.154 200.835 189.148 202.85 189.148ZM215.711 175.984C218.661 175.984 220.553 177.368 220.553 179.952C220.553 182.698 217.867 185.608 213.33 188.64H220.899V191H210.054V188.721C214.795 184.916 217.725 182.434 217.725 180.094C217.725 178.792 216.931 178.059 215.548 178.059C214.307 178.059 213.045 178.812 213.045 181.01H210.319C210.258 177.978 212.455 175.984 215.711 175.984ZM228.124 191.265C224.177 191.265 222.183 188.07 222.183 183.594C222.183 179.117 224.177 175.984 228.124 175.984C232.031 175.984 234.045 179.117 234.045 183.594C234.045 188.07 232.031 191.265 228.124 191.265ZM228.124 189.148C230.139 189.148 231.176 187.154 231.176 183.594C231.176 180.033 230.139 178.1 228.124 178.1C226.11 178.1 225.052 180.033 225.052 183.594C225.052 187.154 226.11 189.148 228.124 189.148Z" fill="#A899FF"/>
+</g>
+<defs>
+<clipPath id="clip0_7501_33965">
+<rect width="367" height="367" fill="white"/>
+</clipPath>
+<clipPath id="clip1_7501_33965">
+<rect width="367" height="367" fill="white"/>
+</clipPath>
+</defs>
+</svg>
diff --git a/images/science/thumbs/sigmod-2024.png b/images/science/thumbs/sigmod-2024.png
diff --git a/images/science/thumbs/sigmod-2024.svg b/images/science/thumbs/sigmod-2024.svg
@@ -0,0 +1,22 @@
+<svg width="367" height="367" viewBox="0 0 367 367" fill="none" xmlns="http://www.w3.org/2000/svg">
+<g clip-path="url(#clip0_7508_33969)">
+<rect width="367" height="367" fill="#A899FF"/>
+<g clip-path="url(#clip1_7508_33969)">
+<rect width="367" height="367" fill="#A899FF"/>
+<path d="M73.4002 319.22L252.313 319.22C259.552 319.22 265.421 313.352 265.421 306.113L265.421 255.978C265.421 248.739 271.289 242.871 278.528 242.871L309.657 242.871C316.896 242.871 322.764 237.003 322.764 229.764L322.764 198.307C322.764 191.068 328.632 185.2 335.871 185.2L381.419 185.2C388.658 185.2 394.526 191.068 394.526 198.307L394.526 410.97C394.526 418.209 388.658 424.077 381.419 424.077L73.4002 424.077C66.1614 424.077 60.2931 418.209 60.2931 410.97L60.2931 332.327C60.2931 325.089 66.1613 319.22 73.4002 319.22Z" fill="#9787F2"/>
+<path d="M32.4401 282.786L-49.152 282.786C-56.3908 282.786 -62.2591 276.918 -62.2591 269.679L-62.2591 -42.5984C-62.2591 -49.8373 -56.3909 -55.7056 -49.152 -55.7056L277.871 -55.7056C285.11 -55.7056 290.978 -49.8374 290.978 -42.5985L290.978 26.8695C290.978 34.1084 285.11 39.9766 277.871 39.9766L181.37 39.9767C174.131 39.9767 168.263 45.845 168.263 53.0838L168.263 109.772C168.263 117.011 162.395 122.879 155.156 122.879L58.6544 122.879C51.4155 122.879 45.5473 128.748 45.5473 135.986L45.5473 269.679C45.5473 276.918 39.679 282.786 32.4401 282.786Z" fill="#9787F2"/>
+<path d="M-50.4624 -24.2488L-50.4624 64.552C-50.4624 71.7909 -44.5942 77.6592 -37.3553 77.6592L71.1065 77.6592C78.3453 77.6592 84.2136 71.7909 84.2136 64.552L84.2136 47.5128C84.2136 40.2739 90.0819 34.4057 97.3208 34.4057L131.072 34.4057C138.31 34.4057 144.179 28.5374 144.179 21.2985L144.179 -24.2487C144.179 -31.4876 138.31 -37.3559 131.072 -37.3559L-37.3553 -37.3559C-44.5942 -37.3559 -50.4624 -31.4876 -50.4624 -24.2488Z" fill="#A899FF"/>
+<path d="M322.927 104.857L322.927 -118.619C322.927 -125.858 328.796 -131.727 336.034 -131.727L381.254 -131.727C388.493 -131.727 394.361 -125.858 394.361 -118.619L394.361 104.857C394.361 112.096 388.493 117.964 381.254 117.964L336.034 117.964C328.796 117.964 322.927 112.096 322.927 104.857Z" fill="#9787F2"/>
+</g>
+<rect x="99" y="155" width="168.287" height="56.8021" fill="black"/>
+<path d="M121.48 184.876L119.608 184.53C117.228 184.041 115.193 182.821 115.193 180.257C115.193 177.286 118.367 175.984 121.073 175.984C124.125 175.984 126.709 177.774 126.872 180.603H124.064C123.84 178.873 122.335 178.303 120.951 178.303C119.588 178.303 118.103 178.812 118.103 180.135C118.103 181.172 118.957 181.742 120.158 181.966L122.091 182.312C124.491 182.759 127.137 183.675 127.137 186.747C127.137 189.779 124.125 191.265 121.155 191.265C117.533 191.265 115.132 189.311 114.867 186.137H117.675C118.001 188.314 119.506 188.945 121.256 188.945C122.559 188.945 124.207 188.457 124.207 186.89C124.207 185.649 123.047 185.14 121.48 184.876ZM131.844 176.248V191H128.935V176.248H131.844ZM141.213 175.984C144.692 175.984 147.012 177.998 147.5 180.725H144.733C144.265 179.097 142.84 178.303 141.192 178.303C138.73 178.303 137.001 180.033 137.001 183.553C137.001 187.073 138.649 188.945 141.172 188.945C142.739 188.945 144.855 188.151 144.855 185.791V185.547H141.172V183.166H147.541V191H145.241L145.16 189.271C144.326 190.491 142.82 191.265 140.724 191.265C136.574 191.265 133.99 188.274 133.99 183.553C133.99 178.975 136.696 175.984 141.213 175.984ZM159.314 191H156.425L152.783 179.911V191H150.076V176.248H154.309L157.89 187.744L161.369 176.248H165.642V191H162.936V179.911L159.314 191ZM170.699 183.614C170.699 187.093 172.611 188.945 174.911 188.945C177.21 188.945 179.122 187.093 179.122 183.614C179.122 180.155 177.21 178.303 174.911 178.303C172.611 178.303 170.699 180.155 170.699 183.614ZM182.134 183.614C182.134 188.192 179.306 191.265 174.911 191.265C170.516 191.265 167.687 188.192 167.687 183.614C167.687 179.036 170.516 175.984 174.911 175.984C179.306 175.984 182.134 179.036 182.134 183.614ZM193.756 183.614C193.756 179.199 191.132 178.527 188.446 178.527H187.103V188.721H188.446C191.132 188.721 193.756 188.07 193.756 183.614ZM184.193 176.248H188.568C193.248 176.248 196.768 177.754 196.768 183.614C196.768 189.474 193.248 191 188.568 191H184.193V176.248ZM208.21 175.984C211.16 175.984 213.052 177.368 213.052 179.952C213.052 182.698 210.366 185.608 205.829 188.64H213.398V191H202.553V188.721C207.294 184.916 210.224 182.434 210.224 180.094C210.224 178.792 209.431 178.059 208.047 178.059C206.806 178.059 205.544 178.812 205.544 181.01H202.818C202.757 177.978 204.954 175.984 208.21 175.984ZM220.623 191.265C216.676 191.265 214.682 188.07 214.682 183.594C214.682 179.117 216.676 175.984 220.623 175.984C224.53 175.984 226.544 179.117 226.544 183.594C226.544 188.07 224.53 191.265 220.623 191.265ZM220.623 189.148C222.638 189.148 223.675 187.154 223.675 183.594C223.675 180.033 222.638 178.1 220.623 178.1C218.609 178.1 217.551 180.033 217.551 183.594C217.551 187.154 218.609 189.148 220.623 189.148ZM233.484 175.984C236.435 175.984 238.327 177.368 238.327 179.952C238.327 182.698 235.641 185.608 231.104 188.64H238.673V191H227.828V188.721C232.569 184.916 235.499 182.434 235.499 180.094C235.499 178.792 234.705 178.059 233.322 178.059C232.081 178.059 230.819 178.812 230.819 181.01H228.093C228.031 177.978 230.229 175.984 233.484 175.984ZM246.193 179.809L241.981 185.079H246.193V179.809ZM251.503 185.079V187.46H248.98V191H246.193V187.46H238.908V185.262L246.233 176.248H248.98V185.079H251.503Z" fill="#A899FF"/>
+</g>
+<defs>
+<clipPath id="clip0_7508_33969">
+<rect width="367" height="367" fill="white"/>
+</clipPath>
+<clipPath id="clip1_7508_33969">
+<rect width="367" height="367" fill="white"/>
+</clipPath>
+</defs>
+</svg>