Skip to content

Commit db9379d

Browse files
authored
Announce SIGMOD 2024 Best Artifact Award in README (#39)
1 parent 3d5437d commit db9379d

File tree

7 files changed

+85
-85
lines changed

7 files changed

+85
-85
lines changed

README.md

Lines changed: 85 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,119 @@
11
# ALP: Adaptive Lossless Floating-Point Compression
22

3-
Lossless floating-point compression algorithm for `double`/`float` data type. ALP significantly improves over all
4-
previous floating-point encodings in both speed and compression ratio (figure below; each dot represents a dataset).
3+
**Authors**: Azim Afroozeh, Leonardo Kuffó, Peter Boncz
4+
**Conference**: ACM SIGMOD 2024
55

6-
<p align="center">
7-
<img src="/publication/alp_results.png" alt="ALP Benchmarks" height="350">
8-
</p>
6+
---
7+
8+
## <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" alt="GitHub" width="32" style="vertical-align:middle;"> What is this repo?
9+
10+
This repository contains the source code and benchmarks for the paper [_ALP: Adaptive Lossless Floating-Point Compression_](https://dl.acm.org/doi/abs/10.1145/3626717), published at ACM SIGMOD 2024.
11+
12+
**ALP** is a state-of-the-art lossless compression algorithm designed for IEEE 754 floating-point data. It encodes data by exploiting two common patterns found in real-world floating-point values:
13+
14+
- **Decimal Floating-Point Numbers**:
15+
A large portion of floats/doubles in real-world datasets are decimals. ALP maps these values into integers by multiplying the number by a power of 10 and then compressing the result using a FastLanes variant of Frame-of-Reference encoding[^1], which is SIMD-friendly.
16+
_Example_: the number `10.12` becomes `1012` and is then fed to the FastLanes encoder.
17+
18+
- **High-Precision Floating-Point Numbers**:
19+
The remaining values are typically high-precision floats/doubles. ALP targets compression opportunities in only the left part of these values, which it compresses using FastLanes dictionary encoding. The right part is left uncompressed, as it is required to preserve high precision and is often highly random and incompressible.
20+
21+
---
22+
23+
## 📊 How does ALP perform?
924

10-
-**High Speed**: Scans 44x faster than Gorilla, 64x faster than Chimp, 31x faster than Zstd. Compresses 11x faster
11-
than Zstd, 138x faster than PDE and x10 faster than Chimp.
12-
-**High Compression**: 50% more compression than Gorillas. 24% more than Chimp128. On par with Zstd level 3.
13-
-**Adaps to data**: By using a two-stage algorithm that first samples row-groups and then vectors.
14-
-**Scalar code**: Auto-vectorizes thanks to [FastLanes](https://github.com/cwida/FastLanes).
15-
-**Lightweight Encoding**: Compression and decompression occurs in blocks of 1024 values. Ideal for columnar
16-
databases.
17-
-**Proven Effectiveness**: Effectiveness and speed led to deprecating Chimp128 and Patas in DuckDB.
18-
-**Works on difficult floats**: Can losslessly compress even floats present as ML models parameters better than Zstd
19-
and all other encodings.
25+
![ALP Results](alp_results.png)
2026

21-
To *rigorously* benchmark ALP with your own data we provide our [ALP primitives](#alp-primitives) as a single C++ header
22-
file.
27+
These results highlight ALP’s **superior** performance across all three key metrics of a compression algorithm:
28+
**Decoding Speed**, **Compression Ratio**, and **Compression Speed**—outperforming other schemes in every category.
2329

24-
ALP details can be found in the [publication](https://dl.acm.org/doi/pdf/10.1145/3626717).
30+
---
2531

26-
## Availability & Reproducibility Initiative (ARI) Report
32+
## 🧪 How to Reproduce Results
2733

28-
In [this report](availability_reproducibility_initiative_report.md), we explain how to replicate the experiments and
29-
benchmarks according to the format requested
30-
in [SIGMOD ARΙ Package Requirements and Guidelines](https://reproducibility.sigmod.org/2024/).
34+
Just run the following script:
3135

32-
On the benchmarked datasets from our publication:
36+
```bash
37+
./publication/script/master_script.sh
38+
```
3339

34-
- ALP achieves on average **x3 compression ratios** (sometimes much, much higher).
35-
- ALP encodes on average 0.5 doubles per CPU cycle.
36-
- ALP decodes on average 2.6 doubles per CPU cycle.
40+
For more information on reproducing our benchmarks, refer to our guide [here](availability_reproducibility_initiative_report.md),
41+
or read the official ACM reproducibility report:
42+
[https://dl.acm.org/doi/10.1145/3687998.3717057](https://dl.acm.org/doi/10.1145/3687998.3717057)
3743

38-
### Used By
3944

40-
<table>
41-
<tr>
42-
<td>
43-
<p align="left">
44-
<img src="https://raw.githubusercontent.com/duckdb/duckdb/main/logo/DuckDB_Logo-horizontal.png" alt="DuckDB" height="50">
45-
</p>
46-
</td>
47-
<td>
48-
<p align="left">
49-
<a href="https://github.com/cwida/FastLanes">FastLanes</a>
50-
</p>
51-
</td>
52-
</tr>
53-
</table>
45+
---
46+
47+
### 🏅 ACM Artifacts & Awards
48+
49+
We are happy to share that we participated in the [SIGMOD Availability & Reproducibility Initiative](https://reproducibility.sigmod.org/), and our paper earned **all three badges**:
50+
51+
<p align="center">
52+
<img src="assets/artifacts_available_v1_1.png" alt="ACM Artifacts Available" height="100"/>
53+
<img src="assets/artifacts_evaluated_reusable_v1_1.png" alt="ACM Artifacts Evaluated" height="100"/>
54+
<img src="assets/results_reproduced_v1_1.png" alt="ACM Results Reproduced" height="100"/>
55+
</p>
56+
57+
🎉 We're also proud to share that **ALP won the [SIGMOD Best Artifact Award](https://sigmod.org/sigmod-awards/sigmod-best-artifact-award/)!**
58+
59+
<p align="center">
60+
<img src="assets/trophy.png" alt="Trophy" height="100"/>
61+
</p>
5462

55-
### Contents
63+
---
5664

57-
- [ALP in a Nutshell](#alp-in-a-nutshell)
58-
- [Quickstart](#quickstart)
59-
- [Building and Running](#building-and-running)
60-
- [ALP Primitives](#alp-primitives)
61-
- [ALP in DuckDB](#alp-in-duckdb)
62-
- [Benchmarking (Replicating Paper Experiments)](#benchmarking-replicating-paper-experiments)
65+
## ⏱️ Want to Benchmark Your Dataset?
6366

64-
## ALP in a Nutshell
67+
Check out our guide: [How to Benchmark Your Dataset](how_to_benchmark_your_dataset.md)
68+
It explains how to run ALP on your own data.
6569

66-
ALP has two compression schemes: `ALP` for doubles/floats which were once decimals, and `ALP_RD` for true
67-
double/floats (e.g. the ones which stem from many calculations, scientific data, ML weights).
70+
---
6871

69-
`ALP` losslessly transforms doubles/floats to integer values with two multiplications to FOR+BitPack them into only the
70-
necessary bits. This is a strongly enhanced version of [PseudoDecimals](https://dl.acm.org/doi/abs/10.1145/3589263).
72+
## 🗂️ Repository Structure
7173

72-
`ALP_RD` splits the doubles/floats bitwise representations into two parts (left and right). The left part is encoded
73-
with a Dictionary compression and the right part is Bitpacked to just the necessary bits.
74+
- `src/`: Core implementation of ALP and ALP_RD
75+
- `benchmarks/`: Benchmarking tools and datasets
76+
- `include/`: Header files for integration
77+
- `scripts/`: Utility scripts for data processing
78+
- `test/`: Unit tests
79+
- `publication/`: Publications and supplementary materials
7480

75-
Both encodings operate in vectors of 1024 values at a time (fit *vectorized execution*) and leverage in-vector
76-
commonalities to achieve higher compression ratios and be faster (by avoiding per-value adaptivity) than other methods.
81+
---
7782

78-
Both encodings encode outliers as *exceptions* to achieve higher compression ratios.
83+
## 📚 Publications
7984

80-
## Building and Running
85+
- **Conference Paper**:
86+
_ALP: Adaptive Lossless Floating-Point Compression_, ACM SIGMOD 2024
87+
[https://dl.acm.org/doi/10.1145/3626717](https://dl.acm.org/doi/10.1145/3626717)
8188

82-
Requirements:
89+
- **Reproducibility Report**:
90+
_Reproducibility Report for ACM SIGMOD 2024 Paper: 'ALP: Adaptive Lossless Floating-Point Compression'_
91+
[https://dl.acm.org/doi/10.1145/3687998.3717057](https://dl.acm.org/doi/10.1145/3687998.3717057)
8392

84-
1) __Clang++__
85-
2) __CMake__ 3.20 or higher
93+
---
8694

87-
## ALP Primitives
95+
## 📄 License
8896

89-
You can make your own [de]compression API by using ALP primitives. An example of the usage of these can be found in our
90-
simple [compression](/include/alp/compressor.hpp) and [decompression](/include/alp/decompressor.hpp) API. The decoding
91-
primitives of ALP are auto-vectorized thanks to [FastLanes](https://github.com/cwida/FastLanes). For **benchmarking**
92-
purposes, we recommend you use these primitives.
97+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
9398

94-
You can use these by including our library in your code: `#include "alp.hpp"`.
99+
---
95100

96-
Check the full documentation of these on the [PRIMITIVES.MD](/PRIMITIVES.md) readme.
101+
## 📬 Contact
97102

98-
## ALP in DuckDB
103+
If you have questions, want to contribute, or just want to stay up to date with ALP and related projects, join our community on Discord:
104+
[![Join us on Discord](https://img.shields.io/badge/Join%20Us%20on%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/2ngmRaRW) [![Community Status](https://img.shields.io/discord/1282716959099588651?label=Members%20Online&logo=discord&logoColor=white&color=5865F2&style=for-the-badge)](https://discord.gg/2ngmRaRW)
99105

100-
ALP replaced Chimp128 and Patas in [DuckDB](https://github.com/duckdb/duckdb/pull/9635). In DuckDB, ALP is **x2-4 times
101-
faster** than Patas (at decompression) achieving **twice as high compression ratios** (sometimes even much more). DuckDB
102-
can be used to quickly test ALP on custom data, however, we advise against doing so if your purpose is to rigorously
103-
benchmark ALP against other algorithms.
106+
---
104107

105-
[Here](https://github.com/duckdb/duckdb/blob/main/benchmark/micro/compression/alp/alp_read.benchmark) you can find a
106-
basic example on how to load data in DuckDB forcing ALP to be used as compression method. These statements can be called
107-
using the Python API.
108+
## 🧩 Used By
108109

109-
**Please note**: ALP inside DuckDB: i) Is slower than using our primitives presented here, and ii) compression ratios
110-
can be slightly worse due to the metadata needed to skip vectors and DuckDB storage layout.
110+
ALP has been integrated into the following systems:
111111

112-
## FCBench
112+
- [**DuckDB**](https://duckdb.org/2024/02/13/announcing-duckdb-0100.html)
113+
- [**FastLanes**](https://github.com/cwida/FastLanes)
114+
- [**KuzuDB**](https://github.com/kuzudb/kuzu/pull/3994)
115+
- [**liquid-cache**](https://github.com/XiangpengHao/liquid-cache/pull/133)
113116

114-
We have benchmarked ALP compression ratios on the datasets presented
115-
on [FCBench](https://www.vldb.org/pvldb/vol17/p1418-tao.pdf). ALP comes on top with an average **compression ratio of
116-
2.08** compared to the best compressor in the benchmark (Bitshuffle + Zstd with 1.47). ALP is superior even despite the
117-
benchmark doing horizontal compression instead of columnar compression (i.e. values from multiple columns in a table are
118-
compressed together).
117+
---
119118

119+
[^1]: Learn more about FastLanes here: [https://github.com/cwida/fastlanes](https://github.com/cwida/fastlanes)

alp_results.png

173 KB
Loading
24.5 KB
Loading
26.8 KB
Loading

assets/results_reproduced_v1_1.png

60.6 KB
Loading

assets/trophy.png

988 KB
Loading

publication/script/install_dependencies.sh

100644100755
File mode changed.

0 commit comments

Comments
 (0)