-
Notifications
You must be signed in to change notification settings - Fork 29
Description
To allow implementations to unambiguously and consistently assess whether two mapping records are identical, the spec could/should define a standard hashing function and how the hash should be computed.
Considerations:
A. Choice of the hashing function. Not really important, we just need to pick one. Easiest option would probably be SHA-256, which is almost ubiquitously available in all programming languages. We do not really need its cryptographic properties of collision resistance and preimage resistance, but they don’t hurt.
B. What to hash? Simply put: everything. That is, all the slots that make up a mapping record. This should also probably include any non-standard slot.
C. How to hash? This is the real question. We need to define a serialisation format such that any given mapping record can have one, and only one, possible serialised form. The “canonical SSSOM/TSV” format as currently defined in the spec is not suitable, as it still leaves some room for variations across implementations.
One option would be to serialise the record into a canonical S-expression, e.g.
(7:mapping((10:subject_id44:http://purl.obolibrary.org/obo/FBbt_00001234)(12:predicate_id:46http://www.w3.org/2004/02/skos/core#exactMatch)(9:object_id45:http://purl.obolibrary.org/obo/UBERON_0005678)(21:mapping_justification51:https://w3id.org/semapv/vocab/ManualMappingCuration)(10:creator_id(37:https://orcid.org/0000-0000-1234-567837:https://orcid.org/0000-0000-5678-1234))
Regardless of the exact serialisation format, prior to serialisation and hashing: (1) all CURIEs must be expanded to their full-length form; (2) all propagatable slots must be propagated; (3) all multi-valued slots must be sorted.