Initial draft of the SSSOM/RDF spec. #469

gouttegd · 2025-07-14T22:18:23Z

Resolves [#421, #457]

docs/ have been added/updated if necessary
make test has been run locally
~~[ ] tests have been added/updated (if applicable)~~
CHANGELOG.md has been updated.

This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.

@nichtich

As noticed by @nichtich: > the use of `pav:authoredOn` only makes sense if `pav:createdOn` is > used as well to differentiate two types of dates, in addition to the > publication date. SSSOM only has one type of date so there is no need > not to use plain old `dcterms:created`. closes #457

matentzn

This is a great start. Lets go a bit back and forth over this; I made my first round of comments with the biggest bomb is to specify a bespoke serialisation of curie_map.

matentzn · 2025-07-16T17:30:50Z

src/docs/spec-formats-rdf.md

+* the predicate is either:
+    * the property indicated by the `URI` field in the LinkML
+      description of the slot, if such a field is present;
+    * or a property constructed by catenating the


Suggested change

* or a property constructed by catenating the

* or a property constructed by concatenating the

matentzn · 2025-07-16T17:32:49Z

src/docs/spec-formats-rdf.md

+#### For slots typed as `sssom:NonRelativeURI`
+(e.g. `license`, `mapping_provider`, `issue_tracker`…)
+
+The value is rendered as a named RDF resource (IRI).


why use "is" instead of MUST BE lingo here?

matentzn · 2025-07-16T17:36:44Z

src/docs/spec-formats-rdf.md

+
+As an exception to the general principle that slots are represented by a
+single RDF triple, multi-valued slots are represented by as many
+triples as there are values, each value being the object of one triple.


i stumbled across this sentence multiple times. Maybe this can be written clearer like:

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Also, is this true for mappings slot as well? Probably not right?

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Does not sound any clearer to me. On the contrary, it sounds like each value is represented by a set of triples, which is certainly not the case.

Also, is this true for mappings slot as well? Probably not right?

Of course it is. Mappings are represented as follows:

?mappingset sssom:mappings [ a owl:Axiom ; owl:annotatedSource ... ] , [ a owl:Axiom ; owl:annotatedSource ... ] .

which fits the description for multi-valued slots: one triple per value.

This is what SSSOM-Py has always done, so I had assumed you were fine with that.

Let me guess: you are no longer happy with that and want to radically change the format?

No, it seems I misunderstood

multi-valued slots are represented by as many triples as there are values, each value being the object of one triple.

I thought this literally meant:

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Which is why I was confused.

Maybe just add an example here?

src/docs/spec-formats-rdf.md

matentzn · 2025-07-16T17:40:40Z

src/docs/spec-formats-rdf.md

+If the `Mapping` object has a `record_id` slot, then the value of that
+slot is used as the named RDF resource that represents the object (and
+consequently, that slot MUST NOT be represented using the [general
+rules](#sssom-slots) for the representation of slots as defined above).


maybe better to phrase this postively, like "rules don't apply".

matentzn · 2025-07-16T17:51:44Z

src/docs/spec-formats-rdf.md

+format (e.g. `@prefix` declarations in [RDF
+Turtle](https://www.w3.org/TR/turtle/) or [RDF
+TriG](https://www.w3.org/TR/trig/), or `xmlns` namespace declarations in
+[RDF/XML](https://www.w3.org/TR/rdf-syntax-grammar/)).


I am a bit against any relationship between curie map and the various RDF prefix syntaxes you list. cc @cthoyt

My main concern is this: two of the most important serialisations (RDF/XML, OWL/XML) can't even accurately represent a sssom:curie_map. Because in these XML serializations the prefix system is hooked into the XML namespacing system, the local identifier part has severe syntactic constraints. In particular, it must correspond to an NCNAME, which means it MUST start with a letter (not a number, so you cant actually represent UBERON:123 in RDF XML).

My vote is to represent the prefix map using the SHACL prefixmap:

<?xml version="1.0"?> <rdf:RDF xmlns:sh="http://www.w3.org/ns/shacl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:ex="http://example.org/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xml:base="http://example.org/shapes">  <sh:NodeShape rdf:about="#MyPrefixMap"> <sh:prefixes> <sh:PrefixDeclaration> <sh:prefix>ex</sh:prefix> <sh:namespace>http://example.org/</sh:namespace> </sh:PrefixDeclaration> <sh:PrefixDeclaration> <sh:prefix>foaf</sh:prefix> <sh:namespace>http://xmlns.com/foaf/0.1/</sh:namespace> </sh:PrefixDeclaration> </sh:prefixes> </sh:NodeShape> </rdf:RDF>

This ensures 100% faithful roundtripping.

My main concern is this: two of the most important serialisations (RDF/XML, OWL/XML) can't even accurately represent a sssom:curie_map. Because in these XML serializations the prefix system is hooked into the XML namespacing system, the local identifier part has severe syntactic constraints. In particular, it must correspond to an NCNAME, which means it MUST start with a letter (not a number, so you cant actually represent UBERON:123 in RDF XML).

Point taken, but that’s a limitation of RDF/XML. Other prefix-supporting formats such as Turtle don’t have that limitation.

You want to preserve the CURIE map accurately and preserve the ability to roundtrip? Then don’t use RDF/XML.

My vote is to represent the prefix map using the SHACL prefixmap:

You are joking. You are joking, right?

You are not seriously proposing something completely different from what we currently have? Something for which we don’t have the inkling of the beginning of support in neither SSSOM-Py nor SSSOM-Java?

Because if you are not joking, I give up.

Design your dream format all you want, and give me a sign when you’ll be done moving the goalposts from one day to the next.

You are joking. You are joking, right?

I did not expect such a strong disagreement :P

Alright, instead of this suggestion, we should then make SSSOM-RDF explicitly write only format.. I am totally fine with this as well, and we don't need to concern ourselves then at all with the curie_map.. This way we can circumvent the limitations outlined above?

I did not expect such a strong disagreement :P

Try sparing some thoughts for the people who have to implement your last-minute ideas that you sprout out of nowhere, maybe you’ll understand.

Two days ago you were reluctant to just changing the predicates to use to represent the core triple (e.g. changing from owl:annotatedSource to sssom:subject_id, which would have been a much smaller change and easier change than what you’re suggesting here). I thought we were in agreement that the SSSOM/RDF format should be as close as possible to what we already have, both to avoid needlessly breaking things and to keep the work needed to make SSSOM-Py support the updated format minimal (given that we have very little developer time available to work on SSSOM-Py). The spec I proposed is tailored for that; it deviates only minimally from the existing, de facto SSSOM/RDF format that SSSOM-Py has been producing since forever; it is already completely supported by SSSOM-Java and I am hopeful that SSSOM-Py could be made to support it relatively easily (precisely due to the minimal deviations). You were fine with the initial proposal, on which you said that you had merely “minor stylistic comments”.

And now, you’re here suggesting that we should in fact do something completely different, that has never even been casually mentioned in any discussion related to the SSSOM/RDF format. Because all of a sudden you are concerned about roundtripping back from RDF/XML, even though the RDF/XML produced by SSSOM-Py has never been roundtrippable and AFAIK nobody has ever complained about that.

So yeah, I disagree with your proposal, because it contradicts everything that you seemed to want just two days ago.

So what do you want now?

A. A SSSOM/RDF format that is minimally different from what we already have, that can be supported rapidly (that is already supported by one implementation), but that (oh, the horror!) does not guarantee that a set written in RDF/XML can be roundtripped back to another SSSOM format?

B. A SSSOM/RDF format that is a clean break from the existing stuff, that will initially not be supported by any implementation (and in fact I doubt it will ever be implemented by SSSOM-Py, given the lack of activity on that implementation)?

If you want B, fine. But then I’ll leave you to design the format. I won’t get involved in any of it, I’ll just wait until you have designed the perfect format of your dream, and then I promise I’ll do my best to implement it.

we should then make SSSOM-RDF explicitly write only format..

Now you are throwing the baby with the bathwater. Just because the RDF/XML concrete serialisation may not guarantee that the prefix map is preserved does not mean that we should give up on SSSOM/RDF being a read/write format.

Again: As currently written, the spec does allow a SSSOM/RDF set to be fully converted back to any other SSSOM format, provided that:

you do not serialise into RDF/XML;

you serialise identifiers as CURIEs and make sure to include the appropriate prefix declarations.

As I said in another comment below, I wrote the spec to be flexible (“à la carte”): if you want the ability to roundtrip between RDF and another format, you can have it; if you are not interested in that ability, you can ignore it.

you do not serialise into RDF/XML;

you serialise identifiers as CURIEs and make sure to include the appropriate prefix declarations.

In fact even if you do serialise into RDF/XML and write identifiers as IRIs, you will still be able convert back to SSSOM/TSV, unless you happen to run your RDF/XML file into a RDF processor that decides to strip away any unused namespace declarations. Not sure if that is a common behaviour among RDF tools, but Jena and RDFLib do not seem to do it – they are happy to let unused namespace declarations pass through unchanged.

Lets tackle the NCname issue with a comment in the documentation: XML formats might not be able to roundtrip

(@gouttegd says: longest URI extension will win)

Maybe just document a rule on conflicting uri prefixes for roundtrip

cant control curiefication in RDF based formats

add "longest uri expansion" assumption to canonicalisation and refer to canonicalisation as "preprocessing" for RDF generation.

matentzn · 2025-07-16T17:52:55Z

src/docs/spec-formats-rdf.md

+> Non-normative notes
+>
+> 1. The CURIE map may not be needed at all if all named resources and
+>    predicates are always serialised as full-length IRIs.


This is only true if we decide that SSSOM-RDF is export only format. I know we said this for SSSOM-OWL, forgot where we ended up with RDF.

This is why there is a subsection ”serialisation of identifiers” in the “Special considerations” section.

The SSSOM/RDF format can be both a read/write format that is equivalent to SSSOM/TSV or SSSOM/JSON (meaning that can roundtrip between all those formats without loss of information) and an export format. It all depends on what you want to do with the output file – something the spec cannot know in advance.

In a sense, SSSOM/RDF is a “à la carte” format. You want to preserve the ability to roundtrip back to SSSOM/TSV? You can, just make sure to include the CURIE map and any extension definitions. You are not interested in being able to come back (say, because all you want is to load the set into a graph database and forget about it)? Then you don’t have to worry about the CURIE map or extension definitions at all.

matentzn · 2025-07-16T17:55:53Z

src/docs/spec-formats-rdf.md

+>    declared (using the appropriate mechanism for the chosen concrete
+>    syntax) takes precedence over the possibility of omitting the
+>    declarations of prefix names that are considered
+>    [built-in](spec-intro.md#iri-prefixes) in the context of SSSOM.


I would compact this sentence which has many redundant parts (the RDF requirement that all used prefix names must be declared) to:

Prefixes considered built-in by other serialisations MUST be rendered using their fully qualified name (IRI / CURIE).

matentzn · 2025-07-16T17:58:02Z

src/docs/spec-formats-rdf.md

+A `ExtensionDefinition` object has no identifier of any kind and is
+always represented by a blank node.
+
+## Special considerations for serialising to RDF


mark as not normative?

src/docs/spec-formats-rdf.md

cthoyt · 2025-08-05T21:47:20Z

src/docs/spec-formats-rdf.md

+      sssom:object_label "Gala apple (whole)";
+      sssom:subject_label "gala"
+    ] .
+```


I would expect that the triples represented by the axioms also to show up somewhere in the RDF

Suggested change

```

KF_FOOD:F001 skos:exactMatch FOODON:00002473 .

KF_FOOD:F002 skos:exactMatch FOODON:00003348 .

```

though it's not clear what to do for negated triples

They appear, but only as reified OWL axioms.

This has been the RDF output produced by SSSOM-Py since the beginning.

Its a valid question though. Without the direct triple, triples stores might not be able to return all terms mapped to ?x with a simple triple pattern - this is only possible because in many of the pipelines I had build that are doing gathering dust, I loaded the sssom file into robot and saved it, which I believe automatically injects that triple.

In the OWL serialisation it seems I have injected them in sssom py:
https://github.com/mapping-commons/sssom-py/blob/master/tests/validate_data/cob-to-external.tsv.owl

Without the direct triple, triples stores might not be able to return all terms mapped to ?x with a simple triple pattern

If that is needed, once the set has been exported to RDF it shouldn’t be hard to process it with some SPARQL to construct a ?subject_id ?predicate_id ?object_id triple from every ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id set of triples, before loading it into a triple store.

I wouldn’t mind adding that as a OPTIONAL behaviour for RDF writer, on the condition that it is really optional – that is, if those ?subject_id ?predicate_it ?object_id triples are absent the set must still be accepted by a SSSOM/RDF reader, as long as the ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id triples are present.

Something like:

A SSSOM/RDF writer MAY additionally inject for every mapping record a triple of the form ?subject_id ?predicate_id ?object_id. A SSSOM/RDF reader MUST NOT expect the presence of such triples.

If we do allow that, this raises the question (as hinted by @cthoyt) of what to do about negated mappings. Two possibilities:

(A) Don’t care. Negated mappings are treated in the same way as any other mappings. If users don’t want ?subject_id ?predicate_id ?object_id triples for negated mappings, it’s up to them to filter out negated mappings before exporting the set to RDF.

(B) Explicitly exclude negated mappings, as in

A SSSOM/RDF writer MAY additionally inject for every mapping record a triple of the form ?subject_id ?predicate_id ?object_id, only for mapping records that do not have a predicate_modifier of Not.

There is also the question of mappings with sssom:NoTermFound, e.g.

record_id subject_id predicate_id object_id object_source

MYMAP:1 HP:1234 skos:exactMatch sssom:NoTermFound obo:doid.owl

Rendering them as

MYMAP:1 a owl:Axiom ; owl:annotatedSource HP:1234 ; owl:annotatedPredicate skos:exactMatch owl:annotatedTarget sssom:NoTermFound ; sssom:object_source obo:doid.owl .

should be perfectly fine, but do we also want a

HP:1234 skos:exactMatch sssom:NoTermFound .

triple as well?

I’d say, we should do for them the same thing as we do for negated mappings (the two possibilities outlined in my previous message).

I am now on board with:

I wouldn’t mind adding that as a OPTIONAL behaviour for RDF writer, on the condition that it is really optional – that is, if those ?subject_id ?predicate_it ?object_id triples are absent the set must still be accepted by a SSSOM/RDF reader, as long as the ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id triples are present.

With regards to the special cases:

I’d say, we should do for them the same thing as we do for negated mappings (the two possibilities outlined in my previous message).

I agree. Just because both are very different use cases I would favour it if implementations would be injecting triples to sssom:NoTermFound versus negated mappings based on separate conditionals; I would probably never add either one, but there may be some use cases to do so.

cthoyt · 2025-08-05T21:48:19Z

src/docs/spec-formats-rdf.md

+as an alternative to
+
+```ttl
+?mapping sssom:predicate_modifier sssom:NegatedPredicate .


why not just decide on one and say "this is the standard"?

That’s what we did. This:

?mapping sssom:predicate_modifier sssom:NegatedPredicate .

is the standard.

But the decision to standardize that form has only been made a few weeks ago. Before that, both SSSOM-Java and SSSOM-Py have been producing the string literal form (SSSOM-Java for the past 8 months – since version 1.1, which introduced RDF support – and SSSOM-Py for as long as it has existed). In fact SSSOM-Py still produces the string literal form to this day.

So for backwards compatibility (which is the entire point of this section, “compatibility with pre-standard representations”), implementations MAY support the old string literal form, even though it is not the standard form.

Makes sense.

Initial draft of the SSSOM/RDF spec.

0c917ec

This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.

gouttegd self-assigned this Jul 14, 2025

gouttegd requested a review from matentzn July 14, 2025 22:23

gouttegd mentioned this pull request Jul 15, 2025

Change mapping_date to be dcterms:created #457

Open

matentzn reviewed Jul 16, 2025

View reviewed changes

cthoyt reviewed Aug 5, 2025

View reviewed changes

	* or a property constructed by catenating the
	* or a property constructed by concatenating the

-```
+KF_FOOD:F001 skos:exactMatch FOODON:00002473 .
+KF_FOOD:F002 skos:exactMatch FOODON:00003348 .
+```

Initial draft of the SSSOM/RDF spec. #469

Are you sure you want to change the base?

Initial draft of the SSSOM/RDF spec. #469

Uh oh!

Conversation

gouttegd commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matentzn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouttegd Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cthoyt Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gouttegd Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gouttegd commented Jul 14, 2025 •

edited

Loading

gouttegd Jul 16, 2025 •

edited

Loading

cthoyt Aug 5, 2025 •

edited

Loading

gouttegd Aug 5, 2025 •

edited

Loading