Skip to content

Conversation

gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Jul 14, 2025

Resolves [#421, #457]

  • docs/ have been added/updated if necessary
  • make test has been run locally
  • [ ] tests have been added/updated (if applicable)
  • CHANGELOG.md has been updated.

This is the complete proposal for the specification of the SSSOM/RDF serialisation format, according to the current state of the discussions about it.

This is the complete proposal for the specification of the SSSOM/RDF
serialisation format, according to the current state of the discussions
about it.
@gouttegd gouttegd self-assigned this Jul 14, 2025
@gouttegd gouttegd requested a review from matentzn July 14, 2025 22:23
As noticed by @nichtich:

> the use of `pav:authoredOn` only makes sense if `pav:createdOn` is
> used as well to differentiate two types of dates, in addition to the
> publication date. SSSOM only has one type of date so there is no need
> not to use plain old `dcterms:created`.

closes #457
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. Lets go a bit back and forth over this; I made my first round of comments with the biggest bomb is to specify a bespoke serialisation of curie_map.

* the predicate is either:
* the property indicated by the `URI` field in the LinkML
description of the slot, if such a field is present;
* or a property constructed by catenating the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* or a property constructed by catenating the
* or a property constructed by concatenating the

#### For slots typed as `sssom:NonRelativeURI`
(e.g. `license`, `mapping_provider`, `issue_tracker`…)

The value is rendered as a named RDF resource (IRI).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use "is" instead of MUST BE lingo here?


As an exception to the general principle that slots are represented by a
single RDF triple, multi-valued slots are represented by as many
triples as there are values, each value being the object of one triple.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i stumbled across this sentence multiple times. Maybe this can be written clearer like:

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this true for mappings slot as well? Probably not right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Does not sound any clearer to me. On the contrary, it sounds like each value is represented by a set of triples, which is certainly not the case.

Also, is this true for mappings slot as well? Probably not right?

Of course it is. Mappings are represented as follows:

?mappingset sssom:mappings [ a owl:Axiom ;
                               owl:annotatedSource ...
                           ] ,
                           [ a owl:Axiom ;
                               owl:annotatedSource ...
                           ] .

which fits the description for multi-valued slots: one triple per value.

This is what SSSOM-Py has always done, so I had assumed you were fine with that.

Let me guess: you are no longer happy with that and want to radically change the format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it seems I misunderstood

multi-valued slots are represented by as many
triples as there are values, each value being the object of one triple.

I thought this literally meant:

for each value {v1,v..,vn} represented by a set of individual triples {a,b,v1; a,b,v2,...a,b,vn}.

Which is why I was confused.

Maybe just add an example here?

If the `Mapping` object has a `record_id` slot, then the value of that
slot is used as the named RDF resource that represents the object (and
consequently, that slot MUST NOT be represented using the [general
rules](#sssom-slots) for the representation of slots as defined above).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to phrase this postively, like "rules don't apply".

format (e.g. `@prefix` declarations in [RDF
Turtle](https://www.w3.org/TR/turtle/) or [RDF
TriG](https://www.w3.org/TR/trig/), or `xmlns` namespace declarations in
[RDF/XML](https://www.w3.org/TR/rdf-syntax-grammar/)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit against any relationship between curie map and the various RDF prefix syntaxes you list. cc @cthoyt

My main concern is this: two of the most important serialisations (RDF/XML, OWL/XML) can't even accurately represent a sssom:curie_map. Because in these XML serializations the prefix system is hooked into the XML namespacing system, the local identifier part has severe syntactic constraints. In particular, it must correspond to an NCNAME, which means it MUST start with a letter (not a number, so you cant actually represent UBERON:123 in RDF XML).

My vote is to represent the prefix map using the SHACL prefixmap:

<?xml version="1.0"?>
<rdf:RDF
    xmlns:sh="http://www.w3.org/ns/shacl#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:ex="http://example.org/"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xml:base="http://example.org/shapes">

  <!-- Define the prefix map -->
  <sh:NodeShape rdf:about="#MyPrefixMap">
    <sh:prefixes>
      <sh:PrefixDeclaration>
        <sh:prefix>ex</sh:prefix>
        <sh:namespace>http://example.org/</sh:namespace>
      </sh:PrefixDeclaration>
      <sh:PrefixDeclaration>
        <sh:prefix>foaf</sh:prefix>
        <sh:namespace>http://xmlns.com/foaf/0.1/</sh:namespace>
      </sh:PrefixDeclaration>
    </sh:prefixes>
  </sh:NodeShape>

</rdf:RDF>

This ensures 100% faithful roundtripping.

Copy link
Contributor Author

@gouttegd gouttegd Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is this: two of the most important serialisations (RDF/XML, OWL/XML) can't even accurately represent a sssom:curie_map. Because in these XML serializations the prefix system is hooked into the XML namespacing system, the local identifier part has severe syntactic constraints. In particular, it must correspond to an NCNAME, which means it MUST start with a letter (not a number, so you cant actually represent UBERON:123 in RDF XML).

Point taken, but that’s a limitation of RDF/XML. Other prefix-supporting formats such as Turtle don’t have that limitation.

You want to preserve the CURIE map accurately and preserve the ability to roundtrip? Then don’t use RDF/XML.

My vote is to represent the prefix map using the SHACL prefixmap:

You are joking. You are joking, right?

You are not seriously proposing something completely different from what we currently have? Something for which we don’t have the inkling of the beginning of support in neither SSSOM-Py nor SSSOM-Java?

Because if you are not joking, I give up.

Design your dream format all you want, and give me a sign when you’ll be done moving the goalposts from one day to the next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are joking. You are joking, right?

I did not expect such a strong disagreement :P

Alright, instead of this suggestion, we should then make SSSOM-RDF explicitly write only format.. I am totally fine with this as well, and we don't need to concern ourselves then at all with the curie_map.. This way we can circumvent the limitations outlined above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not expect such a strong disagreement :P

Try sparing some thoughts for the people who have to implement your last-minute ideas that you sprout out of nowhere, maybe you’ll understand.

Two days ago you were reluctant to just changing the predicates to use to represent the core triple (e.g. changing from owl:annotatedSource to sssom:subject_id, which would have been a much smaller change and easier change than what you’re suggesting here). I thought we were in agreement that the SSSOM/RDF format should be as close as possible to what we already have, both to avoid needlessly breaking things and to keep the work needed to make SSSOM-Py support the updated format minimal (given that we have very little developer time available to work on SSSOM-Py). The spec I proposed is tailored for that; it deviates only minimally from the existing, de facto SSSOM/RDF format that SSSOM-Py has been producing since forever; it is already completely supported by SSSOM-Java and I am hopeful that SSSOM-Py could be made to support it relatively easily (precisely due to the minimal deviations). You were fine with the initial proposal, on which you said that you had merely “minor stylistic comments”.

And now, you’re here suggesting that we should in fact do something completely different, that has never even been casually mentioned in any discussion related to the SSSOM/RDF format. Because all of a sudden you are concerned about roundtripping back from RDF/XML, even though the RDF/XML produced by SSSOM-Py has never been roundtrippable and AFAIK nobody has ever complained about that.

So yeah, I disagree with your proposal, because it contradicts everything that you seemed to want just two days ago.

So what do you want now?

A. A SSSOM/RDF format that is minimally different from what we already have, that can be supported rapidly (that is already supported by one implementation), but that (oh, the horror!) does not guarantee that a set written in RDF/XML can be roundtripped back to another SSSOM format?

B. A SSSOM/RDF format that is a clean break from the existing stuff, that will initially not be supported by any implementation (and in fact I doubt it will ever be implemented by SSSOM-Py, given the lack of activity on that implementation)?

If you want B, fine. But then I’ll leave you to design the format. I won’t get involved in any of it, I’ll just wait until you have designed the perfect format of your dream, and then I promise I’ll do my best to implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should then make SSSOM-RDF explicitly write only format..

Now you are throwing the baby with the bathwater. Just because the RDF/XML concrete serialisation may not guarantee that the prefix map is preserved does not mean that we should give up on SSSOM/RDF being a read/write format.

Again: As currently written, the spec does allow a SSSOM/RDF set to be fully converted back to any other SSSOM format, provided that:

  • you do not serialise into RDF/XML;
  • you serialise identifiers as CURIEs and make sure to include the appropriate prefix declarations.

As I said in another comment below, I wrote the spec to be flexible (“à la carte”): if you want the ability to roundtrip between RDF and another format, you can have it; if you are not interested in that ability, you can ignore it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • you do not serialise into RDF/XML;
  • you serialise identifiers as CURIEs and make sure to include the appropriate prefix declarations.

In fact even if you do serialise into RDF/XML and write identifiers as IRIs, you will still be able convert back to SSSOM/TSV, unless you happen to run your RDF/XML file into a RDF processor that decides to strip away any unused namespace declarations. Not sure if that is a common behaviour among RDF tools, but Jena and RDFLib do not seem to do it – they are happy to let unused namespace declarations pass through unchanged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets tackle the NCname issue with a comment in the documentation: XML formats might not be able to roundtrip

(@gouttegd says: longest URI extension will win)

Maybe just document a rule on conflicting uri prefixes for roundtrip

  • cant control curiefication in RDF based formats
  • add "longest uri expansion" assumption to canonicalisation and refer to canonicalisation as "preprocessing" for RDF generation.

> Non-normative notes
>
> 1. The CURIE map may not be needed at all if all named resources and
> predicates are always serialised as full-length IRIs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if we decide that SSSOM-RDF is export only format. I know we said this for SSSOM-OWL, forgot where we ended up with RDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why there is a subsection ”serialisation of identifiers” in the “Special considerations” section.

The SSSOM/RDF format can be both a read/write format that is equivalent to SSSOM/TSV or SSSOM/JSON (meaning that can roundtrip between all those formats without loss of information) and an export format. It all depends on what you want to do with the output file – something the spec cannot know in advance.

In a sense, SSSOM/RDF is a “à la carte” format. You want to preserve the ability to roundtrip back to SSSOM/TSV? You can, just make sure to include the CURIE map and any extension definitions. You are not interested in being able to come back (say, because all you want is to load the set into a graph database and forget about it)? Then you don’t have to worry about the CURIE map or extension definitions at all.

> declared (using the appropriate mechanism for the chosen concrete
> syntax) takes precedence over the possibility of omitting the
> declarations of prefix names that are considered
> [built-in](spec-intro.md#iri-prefixes) in the context of SSSOM.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would compact this sentence which has many redundant parts (the RDF requirement that all used prefix names must be declared) to:

Prefixes considered built-in by other serialisations MUST be rendered using their fully qualified name (IRI / CURIE).

A `ExtensionDefinition` object has no identifier of any kind and is
always represented by a blank node.

## Special considerations for serialising to RDF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark as not normative?

sssom:object_label "Gala apple (whole)";
sssom:subject_label "gala"
] .
```
Copy link
Member

@cthoyt cthoyt Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that the triples represented by the axioms also to show up somewhere in the RDF

Suggested change
```
KF_FOOD:F001 skos:exactMatch FOODON:00002473 .
KF_FOOD:F002 skos:exactMatch FOODON:00003348 .
```

though it's not clear what to do for negated triples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They appear, but only as reified OWL axioms.

This has been the RDF output produced by SSSOM-Py since the beginning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a valid question though. Without the direct triple, triples stores might not be able to return all terms mapped to ?x with a simple triple pattern - this is only possible because in many of the pipelines I had build that are doing gathering dust, I loaded the sssom file into robot and saved it, which I believe automatically injects that triple.

In the OWL serialisation it seems I have injected them in sssom py:
https://github.com/mapping-commons/sssom-py/blob/master/tests/validate_data/cob-to-external.tsv.owl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the direct triple, triples stores might not be able to return all terms mapped to ?x with a simple triple pattern

If that is needed, once the set has been exported to RDF it shouldn’t be hard to process it with some SPARQL to construct a ?subject_id ?predicate_id ?object_id triple from every ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id set of triples, before loading it into a triple store.

I wouldn’t mind adding that as a OPTIONAL behaviour for RDF writer, on the condition that it is really optional – that is, if those ?subject_id ?predicate_it ?object_id triples are absent the set must still be accepted by a SSSOM/RDF reader, as long as the ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id triples are present.

Something like:

A SSSOM/RDF writer MAY additionally inject for every mapping record a triple of the form ?subject_id ?predicate_id ?object_id. A SSSOM/RDF reader MUST NOT expect the presence of such triples.

If we do allow that, this raises the question (as hinted by @cthoyt) of what to do about negated mappings. Two possibilities:

(A) Don’t care. Negated mappings are treated in the same way as any other mappings. If users don’t want ?subject_id ?predicate_id ?object_id triples for negated mappings, it’s up to them to filter out negated mappings before exporting the set to RDF.

(B) Explicitly exclude negated mappings, as in

A SSSOM/RDF writer MAY additionally inject for every mapping record a triple of the form ?subject_id ?predicate_id ?object_id, only for mapping records that do not have a predicate_modifier of Not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the question of mappings with sssom:NoTermFound, e.g.

record_id subject_id predicate_id object_id object_source
MYMAP:1 HP:1234 skos:exactMatch sssom:NoTermFound obo:doid.owl

Rendering them as

MYMAP:1 a owl:Axiom ;
           owl:annotatedSource HP:1234 ;
           owl:annotatedPredicate skos:exactMatch
           owl:annotatedTarget sssom:NoTermFound ;
           sssom:object_source obo:doid.owl .

should be perfectly fine, but do we also want a

HP:1234 skos:exactMatch sssom:NoTermFound .

triple as well?

I’d say, we should do for them the same thing as we do for negated mappings (the two possibilities outlined in my previous message).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now on board with:

I wouldn’t mind adding that as a OPTIONAL behaviour for RDF writer, on the condition that it is really optional – that is, if those ?subject_id ?predicate_it ?object_id triples are absent the set must still be accepted by a SSSOM/RDF reader, as long as the ?mapping owl:annotatedSource ?subject_id ; owl:annotatedPredicate ?predicate_id ; owl:annotatedTarget ?object_id triples are present.

With regards to the special cases:

I’d say, we should do for them the same thing as we do for negated mappings (the two possibilities outlined in my previous message).

I agree. Just because both are very different use cases I would favour it if implementations would be injecting triples to sssom:NoTermFound versus negated mappings based on separate conditionals; I would probably never add either one, but there may be some use cases to do so.

as an alternative to

```ttl
?mapping sssom:predicate_modifier sssom:NegatedPredicate .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just decide on one and say "this is the standard"?

Copy link
Contributor Author

@gouttegd gouttegd Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s what we did. This:

?mapping sssom:predicate_modifier sssom:NegatedPredicate .

is the standard.

But the decision to standardize that form has only been made a few weeks ago. Before that, both SSSOM-Java and SSSOM-Py have been producing the string literal form (SSSOM-Java for the past 8 months – since version 1.1, which introduced RDF support – and SSSOM-Py for as long as it has existed). In fact SSSOM-Py still produces the string literal form to this day.

So for backwards compatibility (which is the entire point of this section, “compatibility with pre-standard representations”), implementations MAY support the old string literal form, even though it is not the standard form.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants