Skip to content

Commit cc5fae9

Browse files
authored
Align with the new HDF5 policy for missing values, datatypes. (#13)
In version 1.3, number vectors are now allowed to be represented as HDF5 integer datatypes, provided the integer can be fully represented by a double-precision float (i.e., 32-bit or lower). If a missing value placeholder for a number is NaN, all NaNs are to be considered missing. We no longer consider the NaN payloads. Also updated the various ritsuko calls for better error messages.
1 parent 45dbccc commit cc5fae9

File tree

8 files changed

+234
-133
lines changed

8 files changed

+234
-133
lines changed

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
cmake_minimum_required(VERSION 3.24)
22

33
project(uzuki2
4-
VERSION 1.2.0
4+
VERSION 1.3.0
55
DESCRIPTION "Storing simple R lists inside HDF5 or JSON"
66
LANGUAGES CXX)
77

README.md

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ All objects should be nested inside an R list.
2626

2727
The top-level group may have a `uzuki_version` attribute, describing the version of the **uzuki2** specification that it uses.
2828
This should be a scalar string dataset of the form `X.Y` for non-negative integers `X` and `Y`.
29-
The latest version of this specification is **1.2**; if not provided, it is assumed to be **1.0**.
29+
The latest version of this specification is **1.3**; if not provided, it is assumed to be **1.0**.
3030

3131
### Lists
3232

@@ -56,7 +56,10 @@ The allowed HDF5 datatype depends on `uzuki_type`:
5656

5757
- `"integer"`, `"boolean"`: any type of `H5T_INTEGER` that can be represented by a 32-bit signed integer.
5858
Note that the converse is not required, i.e., the storage type does not need to be 32-bit if no such values are present in the dataset.
59-
- `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
59+
- **(for version < 1.3)** `"number"`: any type of `H5T_FLOAT` that can be represented by a double-precision float.
60+
- **(for version >= 1.3)** `"number"`: any type of `H5T_FLOAT` or `H5T_INTEGER` that can be represented exactly by a double-precision (64-bit) float.
61+
This implies a limit of 32 bits for any integer datatype.
62+
See also the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
6063
- `"string"`: any type of `H5T_STRING` that can be represented by a UTF-8 encoded string.
6164
- **(for version 1.0)** `"date"`: any type of `H5T_STRING` where the srings are in the `YYYY-MM-DD` format, or are equal to a missing placeholder value.
6265
- **(for version 1.0)** `"date-time"`: any type of `H5T_STRING` where the srings are Internet Date/Time format, or are equal to a missing placeholder value.
@@ -89,13 +92,19 @@ it is expected that any comparison between the placeholder and strings in `**/da
8992
**(for version == 1.1)**
9093
The data type of the placeholder attribute should have the same data type class as `**/data`.
9194

92-
**(for version >= 1.1)**
93-
Floating point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
95+
**(for version >= 1.3)**
96+
Floating-point missingness should be identified using the equality operator when both the placeholder and data values are loaded into memory as IEEE754-compliant `double`s.
97+
No casting should be performed to a lower-precision type, as this may cause a non-missing value to become equal to the placeholder.
98+
If the placeholder is NaN, all NaNs in the dataset should be considered missing, regardless of the exact bit representation in the NaN payload.
99+
See the [HDF5 policy draft (v0.1.0)](https://github.com/ArtifactDB/Bioc-HDF5-policy/tree/v0.1.0) for more details.
100+
101+
**(for version >= 1.1, < 1.3)**
102+
Floating-point missingness may be encoded in the payload of an NaN, which distinguishes it from a non-missing "not-a-number" value.
94103
Comparisons on NaN placeholders should be performed in a bytewise manner (e.g., with `memcmp`) to ensure that the payload is taken into account.
95104

96105
**(for version 1.0)**
97-
Integer or boolean values of -2147483648 were treated as missing.
98-
Missing floats were represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
106+
Integer or boolean values of -2147483648 are treated as missing.
107+
Missing floats are represented by [R's NA representation](https://github.com/wch/r-source/blob/869e0f734dc4971c420cf417f5e0d18c0974a5af/src/main/arithmetic.c#L90-L98).
99108
For strings, each `**/data` dataset may contain a `missing-value-placeholder` attribute.
100109
If present, this should be a scalar string dataset that specifies the placeholder for missing values.
101110
Any value of `**/data` that is equal to this placeholder should be treated as missing.

include/uzuki2/parse_hdf5.hpp

Lines changed: 93 additions & 80 deletions
Large diffs are not rendered by default.

tests/src/external.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ void expect_hdf5_external_error(std::string path, std::string name, std::string
5252
H5::H5File file(path, H5F_ACC_RDONLY);
5353
EXPECT_ANY_THROW({
5454
try {
55-
uzuki2::hdf5::validate(file.openGroup(name), name, num_expected);
55+
uzuki2::hdf5::validate(file.openGroup(name), num_expected);
5656
} catch (std::exception& e) {
5757
EXPECT_THAT(e.what(), ::testing::HasSubstr(msg));
5858
throw;
@@ -75,7 +75,7 @@ TEST(Hdf5ExternalTest, CheckErrors) {
7575
auto ghandle = external_opener(handle, "foo");
7676
write_scalar(ghandle, "index", 0, H5::PredType::NATIVE_DOUBLE);
7777
}
78-
expect_hdf5_external_error(path, "foo", "expected integer", 1);
78+
expect_hdf5_external_error(path, "foo", "external index at 'index' cannot be represented", 1);
7979

8080
{
8181
H5::H5File handle(path, H5F_ACC_TRUNC);

tests/src/integer.cpp

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ TEST(Hdf5IntegerTest, SimpleLoading) {
5757
}
5858
}
5959

60-
TEST(Hdf5NumberTest, BlockLoading) {
60+
TEST(Hdf5IntegerTest, BlockLoading) {
6161
auto path = "TEST-string.h5";
6262

6363
// Buffer size is 10000, so we make sure we have enough values to go through a few iterations.
@@ -145,6 +145,37 @@ TEST(Hdf5IntegerTest, MissingValues) {
145145
}
146146
}
147147

148+
TEST(Hdf5IntegerTest, ForbiddenType) {
149+
auto path = "TEST-forbidden.h5";
150+
151+
{
152+
H5::H5File handle(path, H5F_ACC_TRUNC);
153+
auto vhandle = vector_opener(handle, "blub", "integer");
154+
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_UINT32);
155+
}
156+
expect_hdf5_error(path, "blub", "cannot be represented");
157+
158+
{
159+
H5::H5File handle(path, H5F_ACC_TRUNC);
160+
auto vhandle = vector_opener(handle, "blub", "integer");
161+
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_INT64);
162+
}
163+
expect_hdf5_error(path, "blub", "cannot be represented by 32-bit");
164+
165+
{
166+
H5::H5File handle(path, H5F_ACC_TRUNC);
167+
auto vhandle = vector_opener(handle, "blub", "integer");
168+
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_UINT16);
169+
}
170+
{
171+
auto parsed = load_hdf5(path, "blub");
172+
EXPECT_EQ(parsed->type(), uzuki2::INTEGER);
173+
auto iptr = static_cast<const DefaultIntegerVector*>(parsed.get());
174+
EXPECT_EQ(iptr->base.values[0], 1);
175+
EXPECT_EQ(iptr->base.values[4], 5);
176+
}
177+
}
178+
148179
TEST(Hdf5IntegerTest, CheckError) {
149180
auto path = "TEST-integer.h5";
150181

@@ -160,15 +191,15 @@ TEST(Hdf5IntegerTest, CheckError) {
160191
auto ghandle = vector_opener(handle, "foo", "integer");
161192
create_dataset<double>(ghandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_DOUBLE);
162193
}
163-
expect_hdf5_error(path, "foo", "expected an integer dataset at 'foo/data'");
194+
expect_hdf5_error(path, "foo", "dataset cannot be represented by 32-bit");
164195

165196
{
166197
H5::H5File handle(path, H5F_ACC_TRUNC);
167198
auto vhandle = vector_opener(handle, "blub", "integer");
168199
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_INT);
169200
create_dataset(vhandle, "names", { "A", "B", "C", "D" });
170201
}
171-
expect_hdf5_error(path, "blub", "should be equal to length");
202+
expect_hdf5_error(path, "blub", "should be equal to the object length");
172203

173204
{
174205
H5::H5File handle(path, H5F_ACC_TRUNC);

tests/src/list.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,23 +88,23 @@ TEST(Hdf5ListTest, CheckError) {
8888
auto ghandle = list_opener(handle, "foo");
8989
create_dataset<int>(ghandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_INT);
9090
}
91-
expect_hdf5_error(path, "foo", "expected a group at 'foo/data'");
91+
expect_hdf5_error(path, "foo", "expected a group at 'data'");
9292

9393
{
9494
H5::H5File handle(path, H5F_ACC_TRUNC);
9595
auto ghandle = list_opener(handle, "foo");
9696
auto dhandle = ghandle.createGroup("data");
9797
nothing_opener(dhandle, "1");
9898
}
99-
expect_hdf5_error(path, "foo", "expected a group at 'foo/data/0'");
99+
expect_hdf5_error(path, "foo", "expected a group at 'data/0'");
100100

101101
{
102102
H5::H5File handle(path, H5F_ACC_TRUNC);
103103
auto ghandle = list_opener(handle, "foo");
104104
auto dhandle = ghandle.createGroup("data");
105105
create_dataset<int>(dhandle, "0", { 1, 2, 3 }, H5::PredType::NATIVE_INT);
106106
}
107-
expect_hdf5_error(path, "foo", "expected a group at 'foo/data/0'");
107+
expect_hdf5_error(path, "foo", "expected a group at 'data/0'");
108108
}
109109

110110

tests/src/misc.cpp

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -41,35 +41,6 @@ TEST(Hdf5AttributeTest, CheckError) {
4141
expect_hdf5_error(path, "whee", "unknown vector type");
4242
}
4343

44-
TEST(Hdf5IntegerTypeTest, Forbidden) {
45-
auto path = "TEST-forbidden.h5";
46-
47-
{
48-
H5::H5File handle(path, H5F_ACC_TRUNC);
49-
auto vhandle = vector_opener(handle, "blub", "integer");
50-
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_UINT32);
51-
}
52-
expect_hdf5_error(path, "blub", "exceeds the range");
53-
54-
{
55-
H5::H5File handle(path, H5F_ACC_TRUNC);
56-
auto vhandle = vector_opener(handle, "blub", "integer");
57-
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_INT64);
58-
}
59-
expect_hdf5_error(path, "blub", "exceeds the range");
60-
61-
{
62-
H5::H5File handle(path, H5F_ACC_TRUNC);
63-
auto vhandle = vector_opener(handle, "blub", "integer");
64-
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_UINT16);
65-
}
66-
auto parsed = load_hdf5(path, "blub");
67-
EXPECT_EQ(parsed->type(), uzuki2::INTEGER);
68-
auto iptr = static_cast<const DefaultIntegerVector*>(parsed.get());
69-
EXPECT_EQ(iptr->base.values[0], 1);
70-
EXPECT_EQ(iptr->base.values[4], 5);
71-
}
72-
7344
class JsonFileTest : public ::testing::TestWithParam<std::tuple<int, bool> > {};
7445

7546
TEST_P(JsonFileTest, Chunking) {

tests/src/number.cpp

Lines changed: 86 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ TEST(Hdf5NumberTest, SimpleLoading) {
4747
********************************************/
4848
}
4949

50-
TEST(Hdf5IntegerTest, BlockLoading) {
50+
TEST(Hdf5NumberTest, BlockLoading) {
5151
auto path = "TEST-string.h5";
5252

5353
// Buffer size is 10000, so we make sure we have enough values to go through a few iterations.
@@ -87,54 +87,131 @@ TEST(Hdf5NumberTest, MissingValues) {
8787
auto path = "TEST-number.h5";
8888

8989
auto missing = ritsuko::r_missing_value();
90+
auto nan = std::numeric_limits<double>::quiet_NaN();
9091
EXPECT_TRUE(std::isnan(missing));
9192

9293
// Old version used the missing R value.
9394
{
9495
H5::H5File handle(path, H5F_ACC_TRUNC);
9596
auto vhandle = vector_opener(handle, "blub", "number");
96-
create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, 1 }, H5::PredType::NATIVE_DOUBLE);
97+
create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, nan, 1 }, H5::PredType::NATIVE_DOUBLE);
9798
}
9899
{
99100
auto parsed = load_hdf5(path, "blub");
100101
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
101102
auto bptr = static_cast<const DefaultNumberVector*>(parsed.get());
102-
EXPECT_EQ(bptr->size(), 5);
103+
EXPECT_EQ(bptr->size(), 6);
103104
EXPECT_EQ(bptr->base.values[2], -123456789);
105+
EXPECT_TRUE(std::isnan(bptr->base.values[4]));
104106
}
105107

106-
// This is no longer directly supported in the new version.
108+
// This is no longer directly supported in the versions >= 1.1.
107109
{
108110
H5::H5File handle(path, H5F_ACC_TRUNC);
109111
auto vhandle = vector_opener(handle, "blub", "number");
110112
add_version(vhandle, "1.1");
111-
create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, 1 }, H5::PredType::NATIVE_DOUBLE);
113+
create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, nan, 1 }, H5::PredType::NATIVE_DOUBLE);
112114
}
113115
{
114116
auto parsed = load_hdf5(path, "blub");
115117
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
116118
auto bptr = static_cast<const DefaultNumberVector*>(parsed.get());
117-
EXPECT_EQ(bptr->size(), 5);
119+
EXPECT_EQ(bptr->size(), 6);
118120
EXPECT_TRUE(std::isnan(bptr->base.values[2]));
121+
EXPECT_TRUE(std::isnan(bptr->base.values[4]));
119122
}
120123

121-
// Unless we specify it.
124+
// Unless we specify it in version 1.1-1.2.
122125
{
123126
H5::H5File handle(path, H5F_ACC_TRUNC);
124127
auto vhandle = vector_opener(handle, "blub", "number");
125128
add_version(vhandle, "1.1");
126129

127-
auto dhandle = create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, 1 }, H5::PredType::NATIVE_DOUBLE);
130+
auto dhandle = create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, nan, 1 }, H5::PredType::NATIVE_DOUBLE);
128131
auto ahandle = dhandle.createAttribute("missing-value-placeholder", H5::PredType::NATIVE_DOUBLE, H5S_SCALAR);
129132
ahandle.write(H5::PredType::NATIVE_DOUBLE, &missing);
130133
}
131134
{
132135
auto parsed = load_hdf5(path, "blub");
133136
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
134137
auto bptr = static_cast<const DefaultNumberVector*>(parsed.get());
135-
EXPECT_EQ(bptr->size(), 5);
138+
EXPECT_EQ(bptr->size(), 6);
136139
EXPECT_EQ(bptr->base.values[2], -123456789);
140+
EXPECT_TRUE(std::isnan(bptr->base.values[4]));
137141
}
142+
143+
// In version 1.3, the NaN payload is now ignored, as it's too fragile.
144+
// This means that all NaNs are considered to be missing if the placeholder is an NaN of any kind.
145+
{
146+
H5::H5File handle(path, H5F_ACC_TRUNC);
147+
auto vhandle = vector_opener(handle, "blub", "number");
148+
add_version(vhandle, "1.3");
149+
150+
auto dhandle = create_dataset<double>(vhandle, "data", { 1, 0, missing, 0, nan, 1 }, H5::PredType::NATIVE_DOUBLE);
151+
auto ahandle = dhandle.createAttribute("missing-value-placeholder", H5::PredType::NATIVE_DOUBLE, H5S_SCALAR);
152+
ahandle.write(H5::PredType::NATIVE_DOUBLE, &missing);
153+
}
154+
{
155+
auto parsed = load_hdf5(path, "blub");
156+
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
157+
auto bptr = static_cast<const DefaultNumberVector*>(parsed.get());
158+
EXPECT_EQ(bptr->size(), 6);
159+
EXPECT_EQ(bptr->base.values[2], -123456789);
160+
EXPECT_EQ(bptr->base.values[4], -123456789);
161+
}
162+
163+
// Of course, non-NaN placeholders are still properly respected.
164+
auto inf = std::numeric_limits<double>::infinity();
165+
{
166+
H5::H5File handle(path, H5F_ACC_TRUNC);
167+
auto vhandle = vector_opener(handle, "blub", "number");
168+
add_version(vhandle, "1.3");
169+
170+
auto dhandle = create_dataset<double>(vhandle, "data", { 1, 0, inf, 0, nan, 1 }, H5::PredType::NATIVE_DOUBLE);
171+
auto ahandle = dhandle.createAttribute("missing-value-placeholder", H5::PredType::NATIVE_DOUBLE, H5S_SCALAR);
172+
ahandle.write(H5::PredType::NATIVE_DOUBLE, &inf);
173+
}
174+
{
175+
auto parsed = load_hdf5(path, "blub");
176+
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
177+
auto bptr = static_cast<const DefaultNumberVector*>(parsed.get());
178+
EXPECT_EQ(bptr->size(), 6);
179+
EXPECT_EQ(bptr->base.values[2], -123456789);
180+
EXPECT_TRUE(std::isnan(bptr->base.values[4]));
181+
}
182+
}
183+
184+
TEST(Hdf5NumberTest, ForbiddenTypes) {
185+
auto path = "TEST-forbidden.h5";
186+
187+
{
188+
H5::H5File handle(path, H5F_ACC_TRUNC);
189+
auto vhandle = vector_opener(handle, "blub", "number");
190+
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_UINT32);
191+
}
192+
expect_hdf5_error(path, "blub", "expected a floating-point dataset");
193+
194+
// Later versions can auto-cast an integer dataset into a float.
195+
{
196+
H5::H5File handle(path, H5F_ACC_RDWR);
197+
add_version(handle.openGroup("blub"), "1.3");
198+
}
199+
{
200+
auto parsed = load_hdf5(path, "blub");
201+
EXPECT_EQ(parsed->type(), uzuki2::NUMBER);
202+
auto iptr = static_cast<const DefaultNumberVector*>(parsed.get());
203+
EXPECT_EQ(iptr->base.values[0], 1);
204+
EXPECT_EQ(iptr->base.values[4], 5);
205+
}
206+
207+
// Unless the integer type is too large.
208+
{
209+
H5::H5File handle(path, H5F_ACC_TRUNC);
210+
auto vhandle = vector_opener(handle, "blub", "number");
211+
create_dataset<int>(vhandle, "data", { 1, 2, 3, 4, 5 }, H5::PredType::NATIVE_INT64);
212+
add_version(handle.openGroup("blub"), "1.3");
213+
}
214+
expect_hdf5_error(path, "blub", "cannot be represented by 64-bit");
138215
}
139216

140217
TEST(Hdf5NumberTest, CheckError) {

0 commit comments

Comments
 (0)