Cleaned up documentation for the HDF5 formats.

LTLA · LTLA · commit a9c66d3caf98 · 2025-03-01T21:07:52.000-08:00
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ In contrast, JSON is easier to parse and has less storage overhead per list elem
 Both the HDF5 and JSON specifications have multiple versions. 
 Links to the version-specific HDF5 specifications are listed below, along with the minimum version of the C++ library required to parse them:
 
+- [1.4](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.4.md), supported by **uzuki2** version ≥ 1.5.
 - [1.3](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.3.md), supported by **uzuki2** version ≥ 1.3.
 - [1.2](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.2.md), supported by **uzuki2** version ≥ 1.2.
 - [1.1](https://github.com/ArtifactDB/uzuki2/tree/gh-pages/docs/specifications/hdf5-1.1.md), supported by **uzuki2** version ≥ 1.1.
diff --git a/docs/specifications/hdf5.Rmd b/docs/specifications/hdf5.Rmd
@@ -126,11 +126,11 @@ The atomic vector's group may also contain `**/names`, a 1-dimensional string da
 This should use a datatype that can be represented by a UTF-8 encoded string.
 If `**/data` is a scalar, `**/names` should have length 1.
 
-### Representing missing values
+#### Representing missing values
 
 ```{r, echo=FALSE, results="asis"}
 if (.version >= package_version("1.1")) {
-    cat('Each `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
+    cat('The `**/data` dataset may optionally contain a `missing-value-placeholder` attribute.
 If present, this should be a scalar dataset that specifies the placeholder for missing values.
 Any value of `**/data` that is equal to this placeholder should be treated as missing.
 If no such attribute is present, it can be assumed that there are no missing values.')
@@ -171,7 +171,7 @@ If no such attribute is present, it can be assumed that there are no missing val
 
 ```{r, echo=FALSE, results="asis"}
 if (.version >= package_version("1.3")) {
-    cat("Check out the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0). for more details.")
+    cat("Check out the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.")
 }
 ```
 
@@ -191,20 +191,31 @@ if (.version == package_version("1.0")) {
   This should use a datatype that can be represented by a UTF-8 encoded string.
 
 The group should contain an 1-dimensional dataset at `**/data`, containing 0-based indices into the levels.
-This should use a HDF5 integer datatype that can be represented by a 32-bit signed integer.
-(Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.)
-Missing values are represented as described above for atomic vectors.
+Vectors of length 1 may also be represented as a scalar dataset.
+(While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.)
+
+The `**/data` dataset should use a HDF5 integer datatype that can be represented by a 32-bit signed integer.
+Admittedly, this should have been an unsigned integer, but we started with a signed integer and we'll just keep it so for back-compatibility.
 
 The group should contain `**/levels`, a 1-dimensional string dataset that contains the levels for the indices in `**/data`.
 This should use a datatype that can be represented by a UTF-8 encoded string.
 Values in `**/levels` should be unique.
 
-Values in `**/data` should be non-negative (missing values excepted) and less than the length of `**/levels`.
+Values in `**/data` should be non-negative (missing value placeholders excepted) and less than the length of `**/levels`.
 Note that the datatype constraints on `**/data` suggest that there should not be more than 2147483647 levels,
 as beyond that, the levels cannot be indexed by elements of `**/data`.
 
+Missing values in the factor are represented by a placeholder, as described [above](representing-missing-values) for atomic integer vectors.
+```{r, echo=FALSE, results="asis"}
+if (.version >= package_version("1.1")) {
+    cat('Specifically, the `**/data` dataset may contain an optional `missing-value-placeholder` attribute,
+which contains the placeholder used to represent missing values inside `**/data`.')
+}
+```
+
 The group may also contain `**/names`, a 1-dimensional string dataset of length equal to `data`.
 This should use a datatype that can be represented by a UTF-8 encoded string.
+If `**/data` is a scalar, `**/names` should have length 1.
 
 ```{r, echo=FALSE, results="asis"}
 if (.version == package_version("1.1")) {
@@ -226,9 +237,10 @@ This is represented as a HDF5 group (`**/`) with the following attributes:
 
 This group should contain the `pointers` and `heap` datasets.
 
-- The `**/data` dataset should be a 1-dimensional or scalar dataset of a compound datatype of 2 members, `"offset"` and `"length"`.
+- The `**/data` dataset should be a 1-dimensional dataset of a compound datatype of 2 members, `"offset"` and `"length"`.
   Each member should be of a datatype that can be represented by an unsigned 64-bit integer.
-  If the dataset is scalar, the length of the VLS array is defined as 1.
+  Arrays of length 1 may also be represented as a scalar dataset.
+  (While R makes no distinction between scalars and length-1 vectors, this may be useful for other frameworks where this difference is relevant.)
 - The `**/heap` dataset should be a 1-dimensional dataset of unsigned 8-bit integers.
 
 Each entry of `**/data` refers to a slice `[offset, offset + length)` of the `**/heap` dataset.