-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem Description
We discussed during our meeting today some possibilities for support user-defined types. We discussed two options: a bytes[n]
type, which would allow users to create arrays whose elements are arbitrarily sized collections of bytes, and more robust support for user-defined types that would be backed by a third-party library like HDF5 or cffi. The bytes[n]
solution, while simple, puts the onus on the user to handle portability across platforms with different endianness, padding, or alignment. True cross-platform support for user-defined types will likely require users to declare the layouts of their custom data types.
We settled on exploring a declarative JSON description for user-defined data types. This JSON description must have a one-to-one correspondence with user-defined data types as supported by libraries like HDF5 or cffi. The idea is that implementations would then be free to store the user-defined data type using the mechanism supported by the binary container.
Strawman Example
{
"binsparse": {
"version": "0.1",
"format": "COO",
"shape": [428440, 896308],
"number_of_stored_values": 3782463,
"data_types": {
"values": "my_cool_struct",
"indices_0": "uint32",
"indices_1": "uint32"
},
"custom_types": {
"my_cool_struct": [
"float", "int32", "uint32", "bint8"
]
}
},
}
Here we have one custom type.
This would correspond to a C struct with four members of type float
, int32_t
, uint32_t
, and uint8_t
. The struct might look like this:
typedef struct {
float v1;
int v2;
uint32_t v3;
uint8_t v4;
} my_cool_struct;
On my system, this struct has a size of 16 bytes. Of course, the padding an alignment of this struct is implementation-defined in C. To compile such a struct, it will need to be declared, and each member will need to have a name. I imagine that in HDF5 this would be provided by the user upon registration of the custom data type.
Just the HDF5 part of this code---written by the user---would look something like the following:
typedef struct {
float v1;
int v2;
unsigned int v3;
unsigned char v4;
} my_cool_struct;
hid_t create_sensor_datatype() {
hid_t datatype_id;
// Create the compound datatype
datatype_id = H5Tcreate(H5T_COMPOUND, sizeof(my_cool_struct));
H5Tinsert(datatype_id, "v1", HOFFSET(my_cool_struct, v1), H5T_NATIVE_FLOAT);
H5Tinsert(datatype_id, "v2", HOFFSET(my_cool_struct, v2), H5T_NATIVE_INT);
H5Tinsert(datatype_id, "v3", HOFFSET(my_cool_struct, v3), H5T_NATIVE_UINT);
H5Tinsert(datatype_id, "v4", HOFFSET(my_cool_struct, v4), H5T_NATIVE_UCHAR);
return datatype_id;
}
I imagine the process for the user would basically be this:
- The user registers a custom HDF5 type (with something like
bsp_register_hdf5_type(my_struct_hid, "my_cool_struct")
). They input an HDF5 custom type (thehid_t
) as well as the new type's name. - Based on the registered type, the backend will create and return to the user a new ID for the
bsp_type_t
enum, perhaps based on hashing the name. - The user reads in a file that uses the newly defined custom data type. This will return the standard
bsp_matrix_t
, which will itself contain avalues
array whose type is equal to the newly createdbsp_type_t
. The implementation can perhaps check that the registered type matches the file's HDF5 type by looking at the types of the corresponding elements. - The user has their data. When they look at the
values
array, they can see that its type corresponds to the newly created type formy_cool_struct
. They can safely cast itsdata
pointer to a pointer tomy_cool_struct
.
Open Questions
I still have a few open questions:
- Should we name custom data type members in the Binsparse JSON description? e.g., instead of storing an array of strings containing the data types, store a tuple:
[("v1", "float"), ("v2", "int32"), ("v3", "uint32"), ("v4", "bint8")]
. - Should we or could we ever attempt to read in a custom data type without a user-registered custom data type? For example, we could have a function that, given a JSON declaration, creates an HDF5 custom data type
hid_t
. The big challenge here is picking the offsets, since we would need to have an algorithm for picking offsets. These offsets would need to be reproducible, reliable, and correspond to the user's offsets for this to work. There's a danger of things not working here. - I opted for a list of types, which I think is necessary, since a JSON dict is unordered. However, there might be some tweaks we could make to improve the syntax of custom data types.