Skip to content

Commit a9c66d3

Browse files
committed
Cleaned up documentation for the HDF5 formats.
1 parent 60726c1 commit a9c66d3

File tree

2 files changed

+22
-9
lines changed

2 files changed

+22
-9
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ In contrast, JSON is easier to parse and has less storage overhead per list elem
2121
Both the HDF5 and JSON specifications have multiple versions.
2222
Links to the version-specific HDF5 specifications are listed below, along with the minimum version of the C++ library required to parse them:
2323

24+
- [1.4](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.4.md), supported by **uzuki2** version ≥ 1.5.
2425
- [1.3](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.3.md), supported by **uzuki2** version ≥ 1.3.
2526
- [1.2](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.2.md), supported by **uzuki2** version ≥ 1.2.
2627
- [1.1](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.1.md), supported by **uzuki2** version ≥ 1.1.

docs/specifications/hdf5.Rmd

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -126,11 +126,11 @@ The atomic vector's group may also contain `**/names`, a 1-dimensional string da
126126
This should use a datatype that can be represented by a UTF-8 encoded string.
127127
If `**/data` is a scalar, `**/names` should have length 1.
128128

129-
### Representing missing values
129+
#### Representing missing values
130130

131131
```{r, echo=FALSE, results="asis"}
132132
if (.version >= package_version("1.1")) {
133-
cat('Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
133+
cat('The `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
134134
If present, this should be a scalar dataset that specifies the placeholder for missing values.
135135
Any value of `**/data` that is equal to this placeholder should be treated as missing.
136136
If no such attribute is present, it can be assumed that there are no missing values.')
@@ -171,7 +171,7 @@ If no such attribute is present, it can be assumed that there are no missing val
171171

172172
```{r, echo=FALSE, results="asis"}
173173
if (.version >= package_version("1.3")) {
174-
cat("Check out the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0). for more details.")
174+
cat("Check out the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.")
175175
}
176176
```
177177

@@ -191,20 +191,31 @@ if (.version == package_version("1.0")) {
191191
This should use a datatype that can be represented by a UTF-8 encoded string.
192192

193193
The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
194-
This should use a HDF5 integer datatype that can be represented by a 32-bit signed integer.
195-
(Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.)
196-
Missing values are represented as described above for atomic vectors.
194+
Vectors of length 1 may also be represented as a scalar dataset.
195+
(While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.)
196+
197+
The `**/data` dataset should use a HDF5 integer datatype that can be represented by a 32-bit signed integer.
198+
Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.
197199

198200
The group should contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`.
199201
This should use a datatype that can be represented by a UTF-8 encoded string.
200202
Values in `**/levels` should be unique.
201203

202-
Values in `**/data` should be non-negative (missing values excepted) and less than the length of `**/levels`.
204+
Values in `**/data` should be non-negative (missing value placeholders excepted) and less than the length of `**/levels`.
203205
Note that the datatype constraints on `**/data` suggest that there should not be more than 2147483647 levels,
204206
as beyond that, the levels cannot be indexed by elements of `**/data`.
205207

208+
Missing values in the factor are represented by a placeholder, as described [above](representing-missing-values) for atomic integer vectors.
209+
```{r, echo=FALSE, results="asis"}
210+
if (.version >= package_version("1.1")) {
211+
cat('Specifically, the `**/data` dataset may contain an optional `missing-value-placeholder` attribute,
212+
which contains the placeholder used to represent missing values inside `**/data`.')
213+
}
214+
```
215+
206216
The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
207217
This should use a datatype that can be represented by a UTF-8 encoded string.
218+
If `**/data` is a scalar, `**/names` should have length 1.
208219

209220
```{r, echo=FALSE, results="asis"}
210221
if (.version == package_version("1.1")) {
@@ -226,9 +237,10 @@ This is represented as a HDF5 group (`**/`) with the following attributes:
226237
227238
This group should contain the `pointers` and `heap` datasets.
228239
229-
- The `**/data` dataset should be a 1-dimensional or scalar dataset of a compound datatype of 2 members, `"offset"` and `"length"`.
240+
- The `**/data` dataset should be a 1-dimensional dataset of a compound datatype of 2 members, `"offset"` and `"length"`.
230241
Each member should be of a datatype that can be represented by an unsigned 64-bit integer.
231-
If the dataset is scalar, the length of the VLS array is defined as 1.
242+
Arrays of length 1 may also be represented as a scalar dataset.
243+
(While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.)
232244
- The `**/heap` dataset should be a 1-dimensional dataset of unsigned 8-bit integers.
233245
234246
Each entry of `**/data` refers to a slice `[offset, offset + length)` of the `**/heap` dataset.

0 commit comments

Comments
 (0)