Skip to content

Commit 07b37e4

Browse files
authored
Add bibtex to docs (#2094)
And update the landing page
1 parent 22fb4d0 commit 07b37e4

File tree

8 files changed

+368
-47
lines changed

8 files changed

+368
-47
lines changed

README.md

Lines changed: 41 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,67 +6,83 @@
66
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/vortex-array)](https://pypi.org/project/vortex-array/)
77

88
> [!TIP]
9-
> Check out the [Docs](https://spiraldb.github.io/vortex/docs/) or jump straight into the [Getting Started Guide](https://spiraldb.github.io/vortex/docs/quickstart.html)
9+
> Check out the [Docs](https://docs.vortex.dev/)
1010
11-
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache Arrow arrays
11+
Vortex is an extensible, state-of-the-art columnar file format, with associated tools for working with compressed Apache
12+
Arrow arrays
1213
in-memory, on-disk, and over-the-wire.
1314

14-
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and scans (2-10x faster),
15+
Vortex is an aspiring successor to Apache Parquet, with dramatically faster random access reads (100-200x faster) and
16+
scans (2-10x faster),
1517
while preserving approximately the same compression ratio and write throughput as Parquet with zstd.
16-
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device decompression on GPUs.
18+
It is designed to support very wide tables (at least 10s of thousands of columns) and (eventually) on-device
19+
decompression on GPUs.
1720

1821
Vortex is intended to be to columnar file formats what Apache DataFusion is to query engines: highly extensible,
1922
extremely fast, & batteries-included.
2023

2124
> [!CAUTION]
2225
> This library is still under rapid development and is a work in progress!
2326
>
24-
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking ways,
27+
> Some key features are not yet implemented, both the API and the serialized format are likely to change in breaking
28+
> ways,
2529
> and we cannot yet guarantee correctness in all cases.
2630
2731
The major features of Vortex are:
2832

2933
* **Logical Types** - a schema definition that makes no assertions about physical layout.
30-
* **Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from Apache Arrow arrays.
31-
* **Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible encodings,
32-
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are implemented
34+
* **Zero-Copy to Arrow** - "canonicalized" (i.e., fully decompressed) Vortex arrays can be zero-copy converted to/from
35+
Apache Arrow arrays.
36+
* **Extensible Encodings** - a pluggable set of physical layouts. In addition to the builtin set of Arrow-compatible
37+
encodings,
38+
the Vortex repository includes a number of state-of-the-art encodings (e.g., FastLanes, ALP, FSST, etc.) that are
39+
implemented
3340
as extensions. While arbitrary encodings can be implemented as extensions, we have intentionally chosen a small set
34-
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access reads,
41+
of encodings that are highly data-parallel, which in turn allows for efficient vectorized decoding, random access
42+
reads,
3543
and (in the future) decompression on GPUs.
3644
* **Cascading Compression** - data can be recursively compressed with multiple nested encodings.
37-
* **Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can trivially be used instead.
45+
* **Pluggable Compression Strategies** - the built-in Compressor is based on BtrBlocks, but other strategies can
46+
trivially be used instead.
3847
* **Compute** - basic compute kernels that can operate over encoded data (e.g., for filter pushdown).
3948
* **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
4049
These are available to compute kernels as well as to the compressor.
4150
* **Serialization** - Zero-copy serialization of arrays, both for IPC and for file formats.
42-
* **Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed array data.
51+
* **Columnar File Format (in progress)** - A modern file format that uses the Vortex serde library to store compressed
52+
array data.
4353
Optimized for random access reads and extremely fast scans; an aspiring successor to Apache Parquet.
4454

4555
## Overview: Logical vs Physical
4656

4757
One of the core design principles in Vortex is strict separation of logical and physical concerns.
4858

49-
For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
59+
For example, a Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical
60+
encoding
5061
(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.
5162

5263
The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct
5364
Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and
5465
`chunked`) that are useful building blocks for other encodings. The included extension encodings are mostly designed
5566
to model compressed in-memory arrays, such as run-length or dictionary encoding.
5667

57-
Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays. Choices
68+
Analogously, `vortex-serde` is designed to handle the low-level physical details of reading and writing Vortex arrays.
69+
Choices
5870
about which encodings to use or how to logically chunk data are left up to the `Compressor` implementation.
5971

60-
One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the
72+
One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data
73+
within the
6174
file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to
6275
the file format specification.
6376

6477
For example, the Compressor implementation can choose to chunk data into a Parquet-like layout with
65-
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it can choose
66-
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is constant
78+
row groups and aligned pages (ChunkedArray of StructArray of ChunkedArrays with equal chunk sizes). Alternatively, it
79+
can choose
80+
to chunk different columns differently based on their compressed size and data distributions (e.g., a column that is
81+
constant
6782
across all rows can be a single chunk, whereas a large string column may be split arbitrarily many times).
6883

69-
In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly into the files
84+
In the same vein, the format is designed to support forward compatibility by optionally embedding WASM decoders directly
85+
into the files
7086
themselves. This should help avoid the rapid calcification that has plagued other columnar file formats.
7187

7288
## Components
@@ -239,7 +255,8 @@ Licensed under the Apache License, Version 2.0 (the "License").
239255
## Governance
240256

241257
Vortex is and will remain an open-source project. Our intent is to model its governance structure after the
242-
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software Foundation.
258+
[Substrait project](https://substrait.io/governance/), which in turn is based on the model of the Apache Software
259+
Foundation.
243260
Expect more details on this in Q4 2024.
244261

245262
## Acknowledgments 🏆
@@ -252,7 +269,8 @@ In particular, the following academic papers have strongly influenced developmen
252269
* Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis.
253270
[BtrBlocks: Efficient Columnar Compression for Data Lakes](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf).
254271
Proc. ACM Manag. Data 1, 2, Article 118 (June 2023), 14 pages.
255-
* Azim Afroozeh and Peter Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
272+
* Azim Afroozeh and Peter
273+
Boncz. [The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar
256274
Code](https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf). PVLDB, 16(9): 2132 - 2144, 2023.
257275
* Peter Boncz, Thomas Neumann, and Viktor Leis. [FSST: Fast Random Access String
258276
Compression](https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf).
@@ -270,10 +288,12 @@ Additionally, we benefited greatly from:
270288

271289
* the existence, ideas, & implementations of both [Apache Arrow](https://arrow.apache.org) and
272290
[Apache DataFusion](https://github.com/apache/datafusion).
273-
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project by [Jorge Leitao](https://github.com/jorgecarleitao).
291+
* the [parquet2](https://github.com/jorgecarleitao/parquet2) project
292+
by [Jorge Leitao](https://github.com/jorgecarleitao).
274293
* the public discussions around choices of compression codecs, as well as the C++ implementations thereof,
275294
from [duckdb](https://github.com/duckdb/duckdb).
276-
* the [Velox](https://github.com/facebookincubator/velox) and [Nimble](https://github.com/facebookincubator/nimble) projects,
295+
* the [Velox](https://github.com/facebookincubator/velox) and [Nimble](https://github.com/facebookincubator/nimble)
296+
projects,
277297
and discussions with their maintainers.
278298

279299
Thanks to all of the aforementioned for sharing their work and knowledge with the world! 🚀

docs/_static/style.css

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1-
html .pst-navbar-icon {
2-
font-size: 1.5rem;
1+
h2 {
2+
font-size: 1.75rem;
33
}
4+
5+
h3 {
6+
font-size: 1.5rem;
7+
}
8+
9+
h4 {
10+
font-size: 1.25rem;
11+
}

docs/conf.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
"sphinx.ext.napoleon",
2525
"sphinx_copybutton",
2626
"sphinx_inline_tabs",
27+
"sphinxcontrib.bibtex",
2728
"sphinxext.opengraph",
2829
]
2930

@@ -70,3 +71,7 @@
7071

7172
ogp_site_url = "https://docs.vortex.dev"
7273
ogp_image = "https://docs.vortex.dev/_static/vortex_spiral_logo.svg"
74+
75+
# -- Options for Sphinx BibTEX -------------------------------------------
76+
77+
bibtex_bibfiles = ["references.bib"]

docs/index.md

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,47 @@
1-
# Vortex: a State-of-the-Art Columnar File Format
1+
# Vortex: the columnar data toolkit
22

3-
Vortex is a fast & extensible columnar file format that is based around the latest research from the
4-
database community. It is built around cascading compression with lightweight, vectorized encodings
5-
(i.e., no block compression), allowing for both efficient random access and extremely fast
6-
decompression.
3+
Vortex is a general purpose toolkit for working with columnar data built around the latest research from the
4+
database community.
75

8-
Vortex includes an accompanying in-memory format for these (recursively) compressed arrays,
9-
that is zero-copy compatible with Apache Arrow in uncompressed form. Taken together, the Vortex
10-
library is a useful toolkit with compressed Arrow data in-memory, on-disk, & over-the-wire.
6+
## In-memory
117

12-
Vortex consolidates the metadata in a series of flatbuffers in the footer, in order to minimize
13-
the number of reads (important when reading from object storage) & the deserialization overhead
14-
(important for wide tables with many columns).
8+
Vortex in-memory arrays support:
159

16-
Vortex aspires to succeed Apache Parquet by pushing the Pareto frontier outwards: 1-2x faster
17-
writes, 2-10x faster scans, and 100-200x faster random access reads, while preserving the same
18-
approximate compression ratio as Parquet v2 with zstd.
10+
* Zero-copy interoperability with [Apache Arrow](https://arrow.apache.org).
11+
* Cascading compression with lightweight, vectorized encodings such as
12+
[FastLanes](https://github.com/spiraldb/fastlanes),
13+
[FSST](https://github.com/spiraldb/fsst),
14+
and [ALP](https://github.com/spiraldb/alp).
15+
* Fast random access to compressed data.
16+
* Compute push-down over compressed data.
17+
* Array statistics for efficient compute.
1918

20-
Its features include:
19+
## On-disk
2120

22-
- A zero-copy data layout for disk, memory, and the wire.
23-
- Kernels for computing on, filtering, slicing, indexing, and projecting compressed arrays.
24-
- Builtin state-of-the-art codecs including FastLanes (integer bit-packing), ALP (floating point),
25-
and FSST (strings).
26-
- Support for custom user-implemented codecs.
27-
- Support for, but no requirement for, row groups.
28-
- A read sub-system supporting filter and projection pushdown.
21+
Vortex ships with an extensible file format supporting:
2922

30-
Vortex's flexible layout empowers writers to choose the right layout for their setting: fast writes,
31-
fast reads, small files, few columns, many columns, over-sized columns, etc.
23+
* Zero-allocation reads, deferring both deserialization and decompression.
24+
* Zero-copy reads from memory-mapped files.
25+
* FlatBuffer metadata to support ultra-wide schemas (>>100k columns).
26+
* Fully customizable layouts and encodings (row-groups, column-groups, writer decides).
27+
* Forwards compatibility by optionally embedding [WASM](https://webassembly.org/) decompression kernels.
28+
29+
## Over-the-wire
30+
31+
Vortex defines a work-in-progress IPC format for sending possibly compressed arrays over the wire.
32+
33+
* Zero-copy serialization and deserialization.
34+
* Support for both compressed and uncompressed data.
35+
* Enables partial compute push-down to storage servers.
36+
* Enables client-side browser decompression with Vortex WASM.
37+
38+
## Extensibility
39+
40+
Vortex is designed to be incredibly extensible. Almost all reader and writer logic is extensible at compile-time
41+
by providing various implementations of Rust traits, and encodings and layouts are extensible at runtime with
42+
dynamically loaded libraries or WebAssembly kernels.
43+
44+
Please reach out to us if you'd like to extend Vortex with your own encodings, layouts, or other functionality.
3245

3346
## Concepts
3447

@@ -91,6 +104,7 @@ hidden:
91104
caption: Project Links
92105
---
93106
107+
references
94108
Spiral <https://spiraldb.com>
95109
GitHub <https://github.com/spiraldb/vortex>
96110
PyPI <https://pypi.org/project/vortex-array>

docs/pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,12 @@ authors = []
66
dependencies = [
77
"furo>=2024.8.6",
88
"myst-parser>=4.0.0",
9+
"setuptools>=75.8.0", # Required by sphinxcontrib-bibtex
910
"sphinx-autobuild>=2024.10.3",
1011
"sphinx-copybutton>=0.5.2",
1112
"sphinx-inline-tabs>=2023.4.21",
1213
"sphinx>=8.0.2",
14+
"sphinxcontrib-bibtex>=2.6.3",
1315
"sphinxext-opengraph>=0.9.1",
1416
"vortex-array",
1517
]

0 commit comments

Comments
 (0)