Skip to content

Conversation

gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Sep 1, 2025

Add a new infer_cardinality method to the MappingSetDataFrame to fill the mapping_cardinality slot with computed cardinality values.

The approach used here is more or less a direct Python translation of my existing implementation in SSSOM-Java. (As such, it may not be as “Pythonic” as the rest of SSSOM-Py’s codebase, but it does the job.)

The gist of it is that we iterate over the entire set of records a first time to populate two hash tables: one that associates a subject to all the different objects it is mapped to, and one that associates an object to all the different subjects it is mapped to. Then we can iterate over the records a second time, and for every record we can immediately get (1) the number of different objects mapped to the same subject and (2) the number of different subjects mapped to the same object; the combination of those two values gives us the cardinality we are looking for.

To deal with the concept of "scope", the "subjects" and "objects" that we use to fill the hash tables are not made of only the "subject_id" slot or the "object_id" slot, but also of all the slots that define the scope. For example, if the scope is ["predicate_id"], then given the following records:

subject_id predicate_id object_id
DO:1234 skos:exactMatch HP:5678
DO:1234 skos:broadMatch MONDO:5678
  • for the first record the "subject" string will contain both DO:1234 and skos:exactMatch, and the "object" string will contain both HP:5678 and skos:exactMatch;
  • for the second record the "subject" string will contain both DO:1234 and skos:broadMatch, and the "object" string will contain both MONDO:5678 and skos:broadMatch.

This way, the "subject" and "object" strings of these records will occupy different entries in the hash tables, thereby ensuring that they are counted separately (as they should since they are in different "scopes").

@gouttegd gouttegd self-assigned this Sep 1, 2025
Add a new `infer_cardinality` method to the `MappingSetDataFrame` to
fill the `mapping_cardinality` slot with computed cardinality values.

The approach used here is more or less a direct Python translation of my
existing implementation in SSSOM-Java.

The gist of it is that we iterate over the entire set of records a first
time to populate two hash tables: one that associates a subject to all
the different objects it is mapped to, and one that associates an object
to all the different subjects it is mapped to. Then we can iterate over
the records a second time, and for every record we can immediately get
(1) the number of different objects mapped to the same subject and (2)
the number of different subjects mapped to the same object; the
combination of those two values gives us the cardinality we are looking
for.

To deal with the concept of "scope", the "subjects" and "objects" that
we use to fill the hash tables are not made of only the "subject_id"
slot or the "object_id" slot, but also of all the slots that define the
scope. For example, if the scope is `["predicate_id"]`, then for the
following record:

  subject_id   predicate_id      object_id
  DO:1234      skos:exactMatch   HP:5678

the "subject" string will contain both `DO:1234` and `skos:exactMatch`,
and the "object" string will contain both `HP:5678` and
`skos:exactMatch`.
@gouttegd gouttegd marked this pull request as ready for review September 2, 2025 18:55
# objects mapped to each subject and vice versa
for _, row in self.df.iterrows():
if (
row.get("subject_id") == "sssom:NoTermFound"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use constants for the fields and sssom:NoTermFound? I think many are already in the sssom.constants module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not frankly convinced it brings any real benefit (it’s not as if those values could change), but OK with that, if only for consistency with the rest of the code. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Aside: is the sssom.constants module entirely hand-written? Seems to me that most of those constant declarations could, and arguably should, be generated from the LinkML schema…)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants