Skip to content

Commit 6921ebb

Browse files
Add a history of storage format versions et. al. (#5205)
[SC-50706](https://app.shortcut.com/tiledb-inc/story/50706/include-all-historical-storage-format-version-information-in-the-current-spec) This PR updates the storage format specification to include information about all past versions. The work is divided into two parts. First, a new file was created that lists what changed in each storage format version, going in more detail than the existing table (which will be removed). After that, the fields in the various data structures will be updated to indicate in which version they were introduced. I sourced the changes in each storage format by searching for the `version(\(\)|_)? (>=?|<=?|==) \d+` regex in the code, as well as searching for references of [these named constants](https://github.com/TileDB-Inc/TileDB/blob/41eb1cc9df9603fc82a6c9dbaf0ae0c25a8ace8f/tiledb/sm/misc/constants.cc#L706-L731). I also made other small fixes to the spec as I found them. Information about past format versions of _groups_ will be added in another PR. --- TYPE: NO_HISTORY DESC: The storage format specification was updated to include information about all previous versions. --------- Co-authored-by: KiterLuc <[email protected]>
1 parent 8b15119 commit 6921ebb

File tree

5 files changed

+176
-32
lines changed

5 files changed

+176
-32
lines changed

format_spec/FORMAT_SPEC.md

Lines changed: 2 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -11,36 +11,10 @@ title: Format Specification
1111
- [Dictionary filter](filters/dictionary_encoding.md)
1212
- RLE filter
1313

14-
## History
15-
16-
|Format version|TileDB version|Description|
17-
|-|-|-|
18-
|1|1.4|[Decouple format and library version](https://github.com/TileDB-Inc/TileDB/commit/610f087515b6de5c3290b09dab30c6943ec77feb)|
19-
|2|1.5|[Always split coordinate tiles](https://github.com/TileDB-Inc/TileDB/commit/9394b38bdfbacd606d673896b4ae87e7968b7c2f)|
20-
|3|1.6|[Parallelize fragment metadata loading](https://github.com/TileDB-Inc/TileDB/commit/a2eb6237e622c3a17691dbe04c9223ba099f7466)|
21-
|4|1.7|[Remove KV storage](https://github.com/TileDB-Inc/TileDB/commit/e733f7baa85a41e25e5834a220234397d6038401)|
22-
|5|2.0|[Split coordinates into individual files](https://github.com/TileDB-Inc/TileDB/commit/d3543bdbc4ee7c2ed1f2de8cee42b04c6ec8eafc)|
23-
|6|2.1|[Implement attribute fill values](https://github.com/TileDB-Inc/TileDB/commit/eaafa47c97af0ee654a0ca2e97da7b8d941e672b)|
24-
|7|2.2|[Nullable attribute support](https://github.com/TileDB-Inc/TileDB/commit/a7fd8d6dd74bb4fa1ae25a6f995da93812f92c20)|
25-
|8|2.3|[Percent encode attribute/dimension file names](https://github.com/TileDB-Inc/TileDB/commit/97c5c4b0aa35cfd96197558ffc1189860b4adc6f)|
26-
|9|2.3|[Name attribute/dimension files by index](https://github.com/TileDB-Inc/TileDB/commit/9a2ed1c22242f097300c2909baf6cb671a7ee33e)|
27-
|10|2.4|[Added array schema evolution](https://github.com/TileDB-Inc/TileDB/commit/41e5e8f4b185f49777560d637b1d61de498364ce)|
28-
|11|2.7|[Store integral cells, aka, don't split cells across chunks](https://github.com/TileDB-Inc/TileDB/commit/beab5113526b7156c8c6492542f1681555c8ae87)|
29-
|12|2.8|[New array directory structure](https://github.com/TileDB-Inc/TileDB/commit/ce204ad1ea5b40f006f4a6ddf240d89c08b3235b)|
30-
|13|2.9|[Add dictionary filter](https://github.com/TileDB-Inc/TileDB/commit/5637e8c678451c9d2356ccada118b504c8ca85f0)|
31-
|14|2.10|[Consolidation with timestamps, add has_timestamps to footer](https://github.com/TileDB-Inc/TileDB/commit/31a3dce8db254efc36f6d28249febed41bba3bcd)|
32-
|15|2.11|[Remove consolidate with timestamps config](https://github.com/TileDB-Inc/TileDB/commit/6b49739e79d804dc56eb0a7e422823ae6f002276)|
33-
|16|2.12|[Implement delete strategy](https://github.com/TileDB-Inc/TileDB/commit/8d64b1f38177113379fa741016136dbd2b06fcfd)|
34-
|17|2.14|[Add dimension labels and data order](https://github.com/TileDB-Inc/TileDB/commit/bb433fcf12dc74a38c7e843808ec1e593b16ce71)|
35-
|18|2.15|[Dimension Labels no longer experimental](https://github.com/TileDB-Inc/TileDB/commit/c3a1bb47e7237f50e8ed9e33abfaa3161e23ff64)|
36-
|19|2.16|[Vac files now use relative URIs](https://github.com/TileDB-Inc/TileDB/commit/ef3236a526b67c50138436a16f67ad274c2ca037)|
37-
|20|2.17|[Enumerations](https://github.com/TileDB-Inc/TileDB/commit/c0d7c6a50fdeffbcc7d8c9ba4a29230fe22baed6)|
38-
|21|2.19|[Tile metadata are now correctly calculated for nullable fixed size strings on dense arrays](https://github.com/TileDB-Inc/TileDB/commit/081bcc5f7ce4bee576f08b97de348236ac88d429)|
39-
|22|2.25|[Add array current domain](https://github.com/TileDB-Inc/TileDB/commit/9116d3c95a83d72545520acb9a7808fc63478963)|
40-
4114
## Table of Contents
4215

4316
* **Array**
17+
* [Format Version History](./history.md)
4418
* [File hierarchy](./array_file_hierarchy.md)
4519
* [Array Schema](./array_schema.md)
4620
* [Fragment](./fragment.md)
@@ -53,4 +27,4 @@ title: Format Specification
5327
* [Consolidated Fragment Metadata File](./consolidated_fragment_metadata_file.md)
5428
* [Filter Pipeline](./filter_pipeline.md)
5529
* [Timestamped Name](./timestamped_name.md)
56-
* [Vacuum Pipeline](./vacuum_file.md)
30+
* [Vacuum File](./vacuum_file.md)

format_spec/history.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
title: Format version history
3+
---
4+
5+
# Format Version History
6+
7+
## Version 22
8+
9+
Introduced in TileDB 2.25
10+
11+
* The _Current domain_ field was added to [array schemas](./array_schema.md#array-schema-file).
12+
13+
## Version 21
14+
15+
Introduced in TileDB 2.19
16+
17+
* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays.
18+
19+
> [!NOTE]
20+
> This version does not contain any changes to the storage format, but was introduced as an indicator for implementations to not rely on tile metadata for nullable fixed-size strings on dense arrays on previous versions.
21+
22+
## Version 20
23+
24+
Introduced in TileDB 2.17
25+
26+
* Arrays can have [enumerations](./enumeration.md).
27+
* The bit-width reduction and positive delta filters are supported on data of date or time types.
28+
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the double-delta filter contain the _Reinterpret datatype_ field.
29+
30+
## Version 19
31+
32+
Introduced in TileDB 2.16
33+
34+
* [Vacuum files](./vacuum_file.md) contain relative paths to the location of the array.
35+
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the delta filter contain the _Reinterpret datatype_ field.
36+
37+
## Version 18
38+
39+
Introduced in TileDB 2.15
40+
41+
* Arrays can have [dimension labels](./array_schema.md#dimension-label).
42+
43+
## Version 17
44+
45+
Introduced in TileDB 2.14
46+
47+
* The _Order_ field was added to [attributes](./array_schema.md#attribute).
48+
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile.
49+
50+
## Version 16
51+
52+
Introduced in TileDB 2.12
53+
54+
* Arrays can have [delete commit files](./delete_commit_file.md).
55+
* Arrays can have [update commit files](./update_commit_file.md).
56+
* The TileDB implementation currently supports writing update commit files as an experimental feature, but they are not yet considered when performing reads.
57+
* Fragment metadata contain [tile processed conditions](./fragment.md#tile-processed-conditions).
58+
59+
## Version 15
60+
61+
Introduced in TileDB 2.11
62+
63+
* Consolidated fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer).
64+
65+
## Version 14
66+
67+
Introduced in TileDB 2.10
68+
69+
* Consolidated fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer).
70+
71+
## Version 13
72+
73+
Introduced in TileDB 2.9
74+
75+
* The [dictionary filter](./filters/dictionary_encoding.md) was added.
76+
77+
## Version 12
78+
79+
Introduced in TileDB 2.8
80+
81+
* The [array file hierarchy](./array_file_hierarchy.md) was updated to store fragments, commits and consolidated fragment metadata in separate subdirectories.
82+
* The extension of commit files was changed to `.wrt`.
83+
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile.
84+
85+
## Version 11
86+
87+
Introduced in TileDB 2.7
88+
89+
* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for each tile.
90+
* The TileDB implementation has been updated to never split cells when storing them in chunks.
91+
92+
## Version 10
93+
94+
Introduced in TileDB 2.4
95+
96+
* Arrays support schema evolution.
97+
* Array schemas are stored in a `__schema` subdirectory, and have a [timestamped name](./timestamped_name.md).
98+
* The _Array schema name_ field was added to the [fragment metadata footer](./fragment.md#footer).
99+
* The _Footer length_ field of the [fragment metadata footer](./fragment.md#footer) is always written.
100+
101+
## Version 9
102+
103+
Introduced in TileDB 2.3
104+
105+
* [Data files](./fragment.md#data-file) are named by the index of their attribute or dimension.
106+
* The _URI_ fields of [Consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) contain relative paths to the location of fragments in the array.
107+
108+
## Version 8
109+
110+
Introduced in TileDB 2.2.3
111+
112+
* [Data files](./fragment.md#data-file) are named by the name of their attribute or dimension, after percent encoding certain characters. These characters are `!#$%&'()*+,/:;=?@[]`, as specified in [RFC 3986](https://tools.ietf.org/html/rfc3986), as well as `"<>\|`, which are not allowed in Windows file names.
113+
114+
## Version 7
115+
116+
Introduced in TileDB 2.2
117+
118+
* Attributes can be nullable.
119+
* The _Nullable_ and _Fill value validity_ fields were added to [attributes](./array_schema.md#attribute).
120+
* The _Validity filters_ field was added to [array schemas](./array_schema.md#array-schema-file).
121+
* Fragment metadata contain validity [tile offsets](./fragment.md#tile-offsets).
122+
123+
## Version 6
124+
125+
Introduced in TileDB 2.1
126+
127+
* The _Fill value_ field was added to [attributes](./array_schema.md#attribute).
128+
129+
## Version 5
130+
131+
Introduced in TileDB 2.0
132+
133+
* Dimensions are stored in separate [data files](./fragment.md#data-file).
134+
* Sparse arrays can have string dimensions and dimensions with different datatypes.
135+
* The _Dimension datatype_, _Cell val num_ and _Filters_ fields were added to [dimensions](./array_schema.md#dimension).
136+
* The _Domain size_ field was added to [dimensions](./array_schema.md#dimension). The domain of a dimension can have a variable size.
137+
* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain).
138+
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions.
139+
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree).
140+
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file).
141+
* Committed fragments are indicated by the presence of an `.ok` file in the array's directory, with the same [timestamped name](./timestamped_name.md) as the fragment.
142+
143+
## Version 4
144+
145+
Introduced in TileDB 1.7
146+
147+
* Support for the [key-value store](https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/tutorials/kv.html) object type was removed. Key-value stores have been superseded by sparse arrays.
148+
149+
## Version 3
150+
151+
Introduced in TileDB 1.6
152+
153+
* The structure of [fragment metadata files](./fragment.md#fragment-metadata-file) was overhauled.
154+
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added.
155+
* The _Bounding coords_ field was removed.
156+
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree.
157+
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel.
158+
159+
## Version 2
160+
161+
Introduced in TileDB 1.5
162+
163+
* Cell coordinate values of each dimension are always stored next to each other, regardless of whether they are filtered with a compression filter or not.
164+
165+
## Version 1
166+
167+
Introduced in TileDB 1.4
168+
169+
* Initial version of the TileDB storage format.

format_spec/vacuum_file.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ my_array # array folder
2222
| ...
2323
```
2424

25-
When located in the commits folder, it will include the URI of fragments (in the `__fragments` folder) that can be vaccumed. When located in the array metadata folder, it will include the URI or array metadata files that can be vaccumed.
25+
When located in the commits folder, it will include the URI of fragments (in the `__fragments` folder) that can be vacuumed. When located in the array metadata folder, it will include the URI or array metadata files that can be vacuumed.
2626

2727
The vacuum file is a simple text file where each line contains a URI string:
2828

tiledb/sm/filter/filter_pipeline.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -288,8 +288,9 @@ class FilterPipeline {
288288
FilterPipeline* pipeline, const EncryptionKey& encryption_key);
289289

290290
/**
291-
* Checks if an attribute/dimension needs to be filtered in chunks or as a
292-
* whole
291+
* Checks if the offsets tiles of an attribute/dimension should be skipped
292+
* from being written. This happens in filters that encode the offsets
293+
* alongside the data.
293294
*
294295
* @param type Datatype of the input attribute/dimension
295296
* @param version Array schema version

tiledb/sm/fragment/fragment_metadata.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2040,7 +2040,7 @@ void FragmentMetadata::load_has_timestamps(Deserializer& deserializer) {
20402040
// ===== FORMAT =====
20412041
// has_delete_meta (char)
20422042
void FragmentMetadata::load_has_delete_meta(Deserializer& deserializer) {
2043-
// Get includes timestamps
2043+
// Get includes delete metadata
20442044
has_delete_meta_ = deserializer.read<char>();
20452045

20462046
// Rebuild index map

0 commit comments

Comments
 (0)