Skip to content

Improvements to the SSSOM workflows#1208

Merged
gouttegd merged 11 commits intomasterfrom
sssom-merge-sets
Apr 15, 2025
Merged

Improvements to the SSSOM workflows#1208
gouttegd merged 11 commits intomasterfrom
sssom-merge-sets

Conversation

@gouttegd
Copy link
Contributor

This PR primarily adds a new value for the maintenance field of SSSOM mapping sets declared in the sssom_mappingset_group: merged.

Given the following configuration:

sssom_mappingset_group:
  products:
    - id: my-set-a
    - id: my-set-b
    - id: my-merged-set
      maintenance: merged
      source_mappings:
        - my-set-a
        - my-set-b

The my-merged-set.sssom.tsv set will be created by merging the my-set-a.sssom.tsv and my-set-b.sssom.tsv sets.

If source_mappings is not set for a merged set, the default behaviour is to use all the other sets as source, so the example above is actually equivalent to:

sssom_mappingset_group:
  products:
    - id: my-set-a
    - id: my-set-b
    - id: my-merged-set
      maintenance: merged

(This is as discussed in #106.)

This PR also brings a few other changes to the SSSOM workflows:

A. Another maintenance value, custom, to explicitly declare that a set is to be built using a custom rule in the custom Makefile (similarly to the module_type: custom for import modules).

B. The possibility to use the xref-extract command of the SSSOM ROBOT plugin, rather than sssom-py parse, to extract mappings from an ontology. This is done with a new group-level option, mapping_extractor. The default value for that option is sssom-py, which preserves the existing behaviour of using sssom-py parse. If set to robot, mappings would be extracted using robot sssom:xref-extract.

C. New helper rules to force mapping sets to be re-serialized, similarly to the normalize_src rule (normalize-sssom-X to re-serialize mapping set X, normalize_mappings to re-serialize all sets).

closes #106

Re-format the SSSOM section in the Makefile to make the Jinja
conditionals easier to follow.

Also move the validate_mappings target to the top of section, right
after the validate-sssom-% target that it uses.
The auto-generated SSSOM mapping set template contains a bogus
declaration for the semapv prefix name (associated to the
<https://w3id.org/semapv/> prefix instead of the expected
<https://w3id.org/semapv/vocab/>). This is an error as per the SSSOM
spec (which forbids associating a built-in prefix name to another
prefix), so we fix that.
Add a new value for the 'maintenance' field of a SSSOM mapping set
product: 'merged'.

A "merged" mapping set is obtained simply by merging other mapping sets
together. The sets that are to be merged should be listed in a new
'source_mappings' key; if that key is absent, then the set is made by
merging all other (non-merged) sets.
For consistency with other product types, add an explicit 'custom' type
of SSSOM mapping set product.

The generated rule for a 'custom' SSSOM set does nothing but emit an
error message, reminding the user that they must override the rule in
their custom Makefile.

(Users could also declare the set as being of type 'manual' -- which
creates a rule that does nothing else but touch'ing the file --, but the
'custom' type makes it more explicit that the set is to be generated by
some custom rule.)
When building a SSSOM mapping set by extracting mappings from an
ontology (maintenance type set to 'extract'), if the ontology to extract
the mappings from is not specified (no 'source_file' key), we default to
extracting from the preprocessed edit file.

We do not change that logic, but we move it from the Makefile template
to the odk.py script, so that the template can just dereference
'mapping.source_file' without having to check whether that field had
been set or not.
Add a new option to the SSSOM mapping set group: 'mapping_extractor'.
That option determines the tool to be used to extract mappings from an
ontology, to create the mapping sets of type 'extract'. The default
extractor is 'sssom-py', which corresponds to the existing behaviour.

The new value is 'robot', which causes the mappings to be extracted with
the sssom:xref-extract command of the SSSOM ROBOT plugin.
Add a helper rule, 'normalize-sssom-%', to force the re-serialisation of
a SSSOM mapping set using SSSOM-CLI. Can be useful especially for
manually maintained sets.

Also add a 'normalize_mappings' rule, which merely runs the above rule
on all declared mapping sets.
@gouttegd gouttegd self-assigned this Apr 12, 2025
@gouttegd gouttegd added this to the 1.6 milestone Apr 12, 2025
Since SSSOM-Java can now possibly be used in a standard workflow, it
belongs to the ODKLite image.

Also upgrade to the latest 1.4.0 version.
The sssom-py mapping extractor generates a bogus mapping set if it does
not find any mapping to extract (no header line), which prevents
SSSOm-CLI to merge it with the other set.
@gouttegd gouttegd requested a review from matentzn April 12, 2025 22:30
When using sssom:xref-extract to extract mappings from an ontology,
preserve any existing metadata in the original SSSOM file.
matentzn
matentzn previously approved these changes Apr 14, 2025
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic. Two optional comments!

raise Exception(f"Unknown source mapping set '{source}'")
elif product.maintenance == "extract":
if product.source_file is None:
product.source_file = "$(EDIT_PREPROCESSED)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming you thought this through - I recently struggled hard with a pipeline that provided a merged sssom mapping set which was partially extracted from upstream, but partially generated by the pipeline itself.. So, something like: build a bridge, then extract a mapping set from that, than merge that into a mapping set you publish. This caused some circularity. I don't think this is the case here, just wanted to share my experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand your concern, and how it applies here.

In any case, the only change here is that the logic to handle the absence of a source_file key has been moved from within the Makefile template to the odk.py script. This is so the template can simply says

$(MAPPINGDIR)/{{ mapping.id }}.sssom.tsv: {{ mapping.source_file }}

instead of

$(MAPPINGDIR)/{{ mapping.id }}.sssom.tsv: {% if mapping.source_file is not none %}{{ mapping.source_file }}{% else %}$(EDIT_PREPROCESSED){% endif %}

This is functionally identical to what we had before, only with a (slightly) more readable template.

odk/odk.py Outdated
"""If set to True, mappings are copied to the release directory."""

mapping_extractor : str = "sssom-py"
"""The tool to use to extract mappings from an ontology ('sssom-py' or 'robot')."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to be more future proof this could be sssom-py and sssom-java (in case someone implements a stellar mapping extraction method in robot itself :P). Ok fantasising now, but.. suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case someone implements a stellar mapping extraction method in robot itself

I think that’s highly unlikely :D but OK, I have no objection with that.

Since the SSSOM-Py-based mapping extraction process is called
'sssom-py', we should call the SSSOM-Java-based one 'sssom-java',
instead of 'robot'.
@gouttegd gouttegd merged commit b5608d9 into master Apr 15, 2025
1 check passed
@gouttegd gouttegd deleted the sssom-merge-sets branch April 15, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a standard pipeline step to export xrefs/mappings in SSSOM format

2 participants