Skip to content

Commit 6564bd7

Browse files
Storage format specification improvements 2/N (#5329)
[SC-54621](https://app.shortcut.com/tiledb-inc/story/54621/explicitly-mention-what-changed-in-each-storage-format-version-throughout-the-specification) Continuing #5205, this PR goes through each format version and mentions the changes to the main specification document. It also documents structures like `__coords.tdb` and legacy fragment metadata, and fixes any defects encountered in the meantime. --- TYPE: FORMAT DESC: The storage format specification was updated to document format changes of previous versions throughout the main document. --------- Co-authored-by: Nick Vigilante <[email protected]>
1 parent 55c5d95 commit 6564bd7

12 files changed

+253
-151
lines changed

format_spec/FORMAT_SPEC.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,18 @@ title: Format Specification
44

55
**Notes:**
66

7-
* The current TileDB format version number is **22** (`uint32_t`).
7+
* The current TileDB array format version number is **22** (`uint32_t`).
8+
* Other structures might be versioned separately.
89
* Data written by TileDB and referenced in this document is **little-endian**
910
with the following exceptions:
1011

11-
- [Dictionary filter](filters/dictionary_encoding.md)
12+
- [Dictionary encoding filter](filters/dictionary_encoding.md)
1213
- RLE filter
1314

1415
## Table of Contents
1516

1617
* **Array**
17-
* [Format Version History](./history.md)
18+
* [Format Version History](./array_format_history.md)
1819
* [File hierarchy](./array_file_hierarchy.md)
1920
* [Array Schema](./array_schema.md)
2021
* [Fragment](./fragment.md)

format_spec/array_file_hierarchy.md

Lines changed: 33 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ An array is a folder with the following structure:
77
```
88
my_array # array folder
99
|_ __schema # array schema folder
10+
|_ <timestamp_name> # array schema files
11+
|_ ...
12+
|_ __enumerations # array enumerations folder
1013
|_ __fragments # array fragments folder
1114
|_ <timestamped_name> # fragment folder
1215
|_ ...
@@ -22,23 +25,39 @@ my_array # array folder
2225
|_ <timestamped_name>.con # consolidated commits file
2326
|_ ...
2427
|_ <timestamped_name>.ign # ignore file for consolidated commits file
25-
|_ __fragment_meta
26-
|_ <timestamped_name>.meta # consol. fragment meta file
27-
|_ ...
28+
|_ __fragment_meta # consolidated fragment metadata folder
29+
|_ <timestamped_name>.meta # consolidated fragment meta file
30+
|_ ...
2831
|_ __meta # array metadata folder
2932
|_ __labels # dimension label folder
30-
33+
|_ <timestamped_name> # legacy fragment folder
34+
|_ ...
35+
|_ <timestamped_name>.ok # legacy fragment write file
36+
|_ <timestamped_name>.meta # legacy consolidated fragment meta file
37+
|_ __array_schema.tdb # legacy array schema file
3138
```
3239

3340
Inside the array folder, you can find the following:
3441

35-
* [Array schema](./array_schema.md) folder `__schema`.
36-
* Inside of a fragments folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).
37-
* Inside of a commit folder, an empty file [`<timestamped_name>`](./timestamped_name.md)`.wrt` associated with every fragment folder [`<timestamped_name>`](./timestamped_name.md), where [`<timestamped_name>`](./timestamped_name.md) is common for the folder and the WRT file. This is used to indicate that fragment [`<timestamped_name>`](./timestamped_name.md) has been *committed* (i.e., its write process finished successfully) and it is ready for use by TileDB. If the WRT file does not exist, the corresponding fragment folder is ignored by TileDB during the reads.
38-
* Inside the same commit folder, any number of [delete commit files](./delete_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.del`.
39-
* Inside the same commit folder, any number of [update commit files](./update_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.upd`.
40-
* Inside the same commit folder, any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`.
41-
* Inside the same commit folder, any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`.
42-
* Inside of a fragment metadata folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`.
43-
* [Array metadata](./metadata.md) folder `__meta`.
44-
* Inside of a labels folder, additional TileDB arrays storing dimension label data.
42+
* Inside of a `__schema` folder, any number of [array schema files](./array_schema.md) [`<timestamped_name>`](./timestamped_name.md).
43+
* **Note**: the name does _not_ include the format version.
44+
* _New in version 20_ Inside of the schema folder, an enumerations folder `__enumerations`.
45+
* Inside of a `__meta` folder, any number of [array metadata files](./metadata.md) [`<timestamped_name>`](./timestamped_name.md).
46+
* Inside of a `__fragments` folder, any number of [fragment folders](./fragment.md) [`<timestamped_name>`](./timestamped_name.md).
47+
* _New in version 18_ Inside of a `__labels` folder, additional TileDB arrays storing dimension label data.
48+
* _New in version 12_ Inside of a `__commits` folder:
49+
* Any number of empty files [`<timestamped_name>`](./timestamped_name.md)`.wrt`, each associated with fragment folder [`<timestamped_name>`](./timestamped_name.md), indicating that the fragment has been *committed* (i.e., its write process finished successfully). If the WRT file does not exist, the corresponding fragment must be ignored when reading the array.
50+
* Any number of [consolidated commits files](./consolidated_commits_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.con`.
51+
* Any number of [ignore files](./ignore_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.ign`.
52+
* _New in version 16_ Any number of [delete commit files](./delete_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.del`.
53+
* _New in version 16_ Any number of [update commit files](./update_commit_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.upd`.
54+
* _New in version 12_ Inside of a `__fragment_meta` folder, any number of [consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) of the form [`<timestamped_name>`](./timestamped_name.md)`.meta`.
55+
56+
> [!NOTE]
57+
> Prior to version 12, fragments, commit files, and consolidated fragment metadata were stored directly in the array folder and the extension of commit files was `.ok` instead of `.wrt`. Implementations must support arrays that contain data in both the old and the new hierarchy at the same time.
58+
59+
> [!NOTE]
60+
> Prior to version 10, the array schema was stored in a single `__array_schema.tdb` file in the array folder. Implementations must support arrays that contain both `__array_schema.tdb` and schemas in the `__schema` folder at the same time. For the purpose of array schema evolution, the timestamp of `__array_schema.tdb` must be considered to be earlier than any schema in the `__schema` folder.
61+
62+
> [!NOTE]
63+
> Prior to version 5, commit files were not written. Fragments of these versions are considered to be committed if their corresponding fragment metadata file exists.
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: Format version history
2+
title: Array format version history
33
---
44

5-
# Format Version History
5+
# Array Format Version History
66

77
## Version 22
88

@@ -24,7 +24,7 @@ Introduced in TileDB 2.19
2424
Introduced in TileDB 2.17
2525

2626
* Arrays can have [enumerations](./enumeration.md).
27-
* The bit-width reduction and positive delta filters are supported on data of date or time types.
27+
* The bit-width reduction and positive delta encoding filters are supported on data of date or time types.
2828
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the double-delta filter contain the _Reinterpret datatype_ field.
2929

3030
## Version 19
@@ -45,7 +45,7 @@ Introduced in TileDB 2.15
4545
Introduced in TileDB 2.14
4646

4747
* The _Order_ field was added to [attributes](./array_schema.md#attribute).
48-
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile.
48+
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary encoding filter exists in the filter pipeline. They are instead encoded as part of the data tile.
4949

5050
## Version 16
5151

@@ -72,7 +72,7 @@ Introduced in TileDB 2.10
7272

7373
Introduced in TileDB 2.9
7474

75-
* The [dictionary filter](./filters/dictionary_encoding.md) was added.
75+
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the dictionary encoding filter exists in the filter pipeline. They are instead encoded as part of the data tile.
7676

7777
## Version 12
7878

@@ -86,7 +86,7 @@ Introduced in TileDB 2.8
8686

8787
Introduced in TileDB 2.7
8888

89-
* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for each tile.
89+
* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for data in the whole fragment and each tile.
9090
* The TileDB implementation has been updated to never split cells when storing them in chunks.
9191

9292
## Version 10
@@ -154,7 +154,7 @@ Introduced in TileDB 1.6
154154
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added.
155155
* The _Bounding coords_ field was removed.
156156
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree.
157-
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel.
157+
* Tile offsets and sizes are wrapped in their own generic tiles. This allows loading them lazily and in parallel.
158158

159159
## Version 2
160160

format_spec/array_schema.md

Lines changed: 43 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -2,75 +2,48 @@
22
title: Array Schema
33
---
44

5-
## Current Array Schema Version
6-
7-
The current array schema version(`>=10`) is a folder called `__schema` located here:
8-
9-
```
10-
my_array # array folder
11-
| ...
12-
|_ __schema # array schema folder
13-
|_ <timestamped_name> # array schema file
14-
|_ ...
15-
```
16-
17-
The array schema folder can contain:
18-
19-
* Any number of [array schema files](#array-schema-file) with name [`<timestamped_name>`](./timestamped_name.md).
20-
* Note: the name does _not_ include the format version.
21-
22-
## Previous Array Schema Version
23-
24-
The previous array schema version(`<=9`) has a file named `__array_schema.tdb` and is located here:
25-
26-
```
27-
my_array # array folder
28-
|_ ....
29-
|_ __array_schema.tdb # array schema file
30-
|_ ...
31-
```
32-
335
## Array Schema File
346

357
The array schema file consists of a single [generic tile](./generic_tile.md), with the following data:
368

379
| **Field** | **Type** | **Description** |
3810
| :--- | :--- | :--- |
39-
| Array version | `uint32_t` | Format version number of the array schema |
40-
| Allows dups | `bool` | Whether or not the array allows duplicate cells |
11+
| Array version | `uint32_t` | [Format version](./array_format_history.md) number of the array schema |
12+
| Allows dups | `bool` | _New in version 5_ Whether or not the array allows duplicate cells |
4113
| Array type | `uint8_t` | Dense or sparse |
4214
| Tile order | `uint8_t` | Row or column major |
4315
| Cell order | `uint8_t` | Row or column major |
4416
| Capacity | `uint64_t` | For sparse fragments, the data tile capacity |
4517
| Coords filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used as default for coordinate tiles |
4618
| Offsets filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used for cell var-len offset tiles |
47-
| Validity filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used for cell validity tiles |
19+
| Validity filters | [Filter Pipeline](./filter_pipeline.md) | _New in version 7_ The filter pipeline used for cell validity tiles |
4820
| Domain | [Domain](#domain) | The array domain |
4921
| Num attributes | `uint32_t` | Number of attributes in the array |
5022
| Attribute 1 | [Attribute](#attribute) | First attribute |
5123
||||
5224
| Attribute N | [Attribute](#attribute) | Nth attribute |
53-
| Num labels | `uint32_t` | Number of dimension labels in the array |
54-
| Label 1 | [Dimension Label](#dimension_label) | First dimension label |
25+
| Num labels | `uint32_t` | _New in version 18_ Number of dimension labels in the array |
26+
| Label 1 | [Dimension Label](#dimension_label) | _New in version 18_ First dimension label |
5527
||||
56-
| Label N | [Dimension Label](#dimension_label) | Nth dimension label |
57-
| Num enumerations | `uint32_t` | Number of [enumerations](./enumeration.md) in the array |
58-
| Enumeration name length 1 | `uint32_t` | The number of characters in the enumeration 1 name |
59-
| Enumeration name 1 | `uint8_t[]` | The name of enumeration 1 |
60-
| Enumeration filename length 1 | `uint32_t` | The number of characters in the enumeration 1 file |
61-
| Enumeration filename 1 | `uint8_t[]` | The name of the file in the `__enumerations` subdirectory that conatins enumeration 1's data |
62-
| Enumeration name length N | `uint32_t` | The number of characters in the enumeration N name |
63-
| Enumeration name N | `uint8_t[]` | The name of enumeration N |
64-
| Enumeration filename length N | `uint32_t` | The number of characters in the enumeration N file |
65-
| Enumeration filename N | `uint8_t[]` | The name of the file in the `__enumerations` subdirectory that conatins enumeration N's data |
66-
| CurrentDomain | [CurrentDomain](./current_domain.md) | The array current domain |
28+
| Label N | [Dimension Label](#dimension_label) | _New in version 18_ Nth dimension label |
29+
| Num enumerations | `uint32_t` | _New in version 20_ Number of [enumerations](./enumeration.md) in the array |
30+
| Enumeration name length 1 | `uint32_t` | _New in version 20_ The number of characters in the enumeration 1 name |
31+
| Enumeration name 1 | `uint8_t[]` | _New in version 20_ The name of enumeration 1 |
32+
| Enumeration filename length 1 | `uint32_t` | _New in version 20_ The number of characters in the enumeration 1 file |
33+
| Enumeration filename 1 | `uint8_t[]` | _New in version 20_ The name of the file in the `__enumerations` subdirectory that contains enumeration 1's data |
34+
| Enumeration name length N | `uint32_t` | _New in version 20_ The number of characters in the enumeration N name |
35+
| Enumeration name N | `uint8_t[]` | _New in version 20_ The name of enumeration N |
36+
| Enumeration filename length N | `uint32_t` | _New in version 20_ The number of characters in the enumeration N file |
37+
| Enumeration filename N | `uint8_t[]` | _New in version 20_ The name of the file in the `__enumerations` subdirectory that contains enumeration N's data |
38+
| Current domain | [Current Domain](#current-domain) | _New in version 22_ The array's current domain |
6739

6840
## Domain
6941

7042
The domain has internal format:
7143

7244
| **Field** | **Type** | **Description** |
7345
| :--- | :--- | :--- |
46+
| Domain datatype | `uint8_t` | _Removed in version 5_ Datatype of all dimensions |
7447
| Num dimensions | `uint32_t` | Dimensionality/rank of the domain |
7548
| Dimension 1 | [Dimension](#dimension) | First dimension |
7649
||||
@@ -84,14 +57,17 @@ The dimension has internal format:
8457
| :--- | :--- | :--- |
8558
| Dimension name length | `uint32_t` | Number of characters in dimension name |
8659
| Dimension name | `uint8_t[]` | Dimension name character array |
87-
| Dimension datatype | `uint8_t` | Datatype of the coordinate values |
88-
| Cell val num | `uint32_t` | Number of coordinate values per cell. For variable-length dimensions, this is `std::numeric_limits<uint32_t>::max()` |
89-
| Filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used on coordinate value tiles |
90-
| Domain size | `uint64_t[]` | The domain size in bytes |
60+
| Dimension datatype | `uint8_t` | _New in version 5_ Datatype of the coordinate values |
61+
| Cell val num | `uint32_t` | _New in version 5_ Number of coordinate values per cell. For variable-length dimensions, this is `std::numeric_limits<uint32_t>::max()` |
62+
| Filters | [Filter Pipeline](./filter_pipeline.md) | _New in version 5_ The filter pipeline used on coordinate value tiles |
63+
| Domain size | `uint64_t` | _New in version 5_ The domain size in bytes |
9164
| Domain | `uint8_t[]` | Byte array of length equal to domain size above, storing the min, max values of the dimension. |
9265
| Null tile extent | `uint8_t` | `1` if the dimension has a null tile extent, else `0`. |
9366
| Tile extent | `uint8_t[]` | Byte array of length equal to the dimension datatype size, storing the space tile extent of this dimension. |
9467

68+
> [!NOTE]
69+
> Prior to version 5, the size of the _Domain_ field was always equal to twice the size of the dimension's data type (which is stored in the [domain](#domain) in these versions).
70+
9571
## Attribute
9672

9773
The attribute has internal format:
@@ -103,11 +79,11 @@ The attribute has internal format:
10379
| Attribute datatype | `uint8_t` | Datatype of the attribute values |
10480
| Cell val num | `uint32_t` | Number of attribute values per cell. For variable-length attributes, this is `std::numeric_limits<uint32_t>::max()` |
10581
| Filters | [Filter Pipeline](./filter_pipeline.md) | The filter pipeline used on attribute value tiles |
106-
| Fill value size | `uint64_t` | The size in bytes of the fill value |
107-
| Fill value | `uint8_t[]` | The fill value |
108-
| Nullable | `bool` | Whether or not the attribute can be null |
109-
| Fill value validity | `uint8_t` | The validity fill value |
110-
| Order | `uint8_t` | Order of the data stored in the attribute. This may be unordered, increasing or decreasing |
82+
| Fill value size | `uint64_t` | _New in version 6_ The size in bytes of the fill value |
83+
| Fill value | `uint8_t[]` | _New in version 6_ The fill value |
84+
| Nullable | `bool` | _New in version 7_ Whether or not the attribute can be null |
85+
| Fill value validity | `uint8_t` | _New in version 7_ The validity fill value |
86+
| Order | `uint8_t` | _New in version 17_ Order of the data stored in the attribute. This may be unordered, increasing or decreasing |
11187

11288
## Dimension Label
11389

@@ -127,6 +103,19 @@ The dimension label has internal format:
127103
| Label datatype | `uint8_t` | The datatype of the label data |
128104
| Label cell_val_num | `uint32_t` | The number of values per cell of the label data. For variable-length labels, this is `std::numeric_limits<uint32_t>::max()` |
129105
| Label domain size | `uint64_t` | The size of the label domain |
130-
| Label domain start size | `uint64_t` | The size of the first value of the domain for variable-lenght datatypes. For fixed-lenght labels, this is 0|
106+
| Label domain start size | `uint64_t` | The size of the first value of the domain for variable-length datatypes. For fixed-length labels, this is 0|
131107
| Label domain data | `uint8_t[]`| Byte array of length equal to domain size above, storing the min, max values of the dimension |
132108
| Is external | `uint8_t` | If the URI is not stored as part of this array |
109+
110+
## Current Domain
111+
112+
If a current domain is empty, only the version number and the empty flag are serialized to storage.
113+
114+
The current domain format is versioned separately from arrays. The current version is `1`.
115+
116+
| **Field** | **Type** | **Description** |
117+
| :--- | :--- | :--- |
118+
| Version number | `uint32_t` | Current domain version number |
119+
| Empty | `uint8_t` | Whether the current domain has a representation (e.g. NDRectangle) set |
120+
| Type | `uint8_t` | The type of current domain stored in this file |
121+
| NDRectangle | [MBR](./fragment.md#mbr) | A hyperrectangle defined using [1DRange](./fragment.md#mbr) items for each dimension |

0 commit comments

Comments
 (0)