You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Check out the [Docs](https://spiraldb.github.io/vortex/docs/) or jump straight into the [Getting Started Guide](https://spiraldb.github.io/vortex/docs/quickstart.html)
9
+
> Check out the [Docs](https://docs.vortex.dev/)
10
10
11
-
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
11
+
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache
12
+
Arrow arrays
12
13
in-memory, on-disk, and over-the-wire.
13
14
14
-
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
15
+
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and
16
+
scans (2-10x faster),
15
17
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
16
-
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
18
+
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device
19
+
decompression on GPUs.
17
20
18
21
Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
19
22
extremely fast, & batteries-included.
20
23
21
24
> [!CAUTION]
22
25
> This library is still under rapid development and is a work in progress!
23
26
>
24
-
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking ways,
27
+
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking
28
+
> ways,
25
29
> and we cannot yet guarantee correctness in all cases.
26
30
27
31
The major features of Vortex are:
28
32
29
33
***Logical Types** - a schema definition that makes no assertions about physical layout.
30
-
***Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from Apache Arrow arrays.
31
-
***Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible encodings,
32
-
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are implemented
34
+
***Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from
35
+
Apache Arrow arrays.
36
+
***Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible
37
+
encodings,
38
+
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are
39
+
implemented
33
40
as extensions. While arbitrary encodings can be implemented as extensions, we have intentionally chosen a small set
34
-
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access reads,
41
+
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access
42
+
reads,
35
43
and (in the future) decompression on GPUs.
36
44
***Cascading Compression** - data can be recursively compressed with multiple nested encodings.
37
-
***Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can trivially be used instead.
45
+
***Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can
46
+
trivially be used instead.
38
47
***Compute** - basic compute kernels that can operate over encoded data (e.g., for filter pushdown).
39
48
***Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
40
49
These are available to compute kernels as well as to the compressor.
41
50
***Serialization** - Zero-copy serialization of arrays, both for IPC and for file formats.
42
-
***Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed array data.
51
+
***Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed
52
+
array data.
43
53
Optimized for random access reads and extremely fast scans; an aspiring successor to Apache Parquet.
44
54
45
55
## Overview: Logical vs Physical
46
56
47
57
One of the core design principles in Vortex is strict separation of logical and physical concerns.
48
58
49
-
For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
59
+
For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical
60
+
encoding
50
61
(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.
51
62
52
63
The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct
53
64
Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and
54
65
`chunked`) that are useful building blocks for other encodings. The included extension encodings are mostly designed
55
66
to model compressed in-memory arrays, such as run-length or dictionary encoding.
56
67
57
-
Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays. Choices
68
+
Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays.
69
+
Choices
58
70
about which encodings to use or how to logically chunk data are left up to the `Compressor` implementation.
59
71
60
-
One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the
72
+
One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data
73
+
within the
61
74
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
62
75
the file format specification.
63
76
64
77
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
65
-
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
66
-
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
78
+
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it
79
+
can choose
80
+
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is
81
+
constant
67
82
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).
68
83
69
-
In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
84
+
In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly
85
+
into the files
70
86
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.
71
87
72
88
## Components
@@ -239,7 +255,8 @@ Licensed under the Apache License, Version 2.0 (the "License").
239
255
## Governance
240
256
241
257
Vortex is and will remain an open-source project. Our intent is to model its governance structure after the
242
-
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software Foundation.
258
+
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software
259
+
Foundation.
243
260
Expect more details on this in Q4 2024.
244
261
245
262
## Acknowledgments 🏆
@@ -252,7 +269,8 @@ In particular, the following academic papers have strongly influenced developmen
252
269
* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
253
270
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
254
271
Proc. ACM Manag. Data 1, 2, Article 118 (June 2023), 14 pages.
255
-
* Azim Afroozeh and Peter Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
272
+
* Azim Afroozeh and Peter
273
+
Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
0 commit comments