Skip to content

Commit cbbbe48

Browse files
committed
Add the infer_cardinality method.
Add a new `infer_cardinality` method to the `MappingSetDataFrame` to fill the `mapping_cardinality` slot with computed cardinality values. The approach used here is more or less a direct Python translation of my existing implementation in SSSOM-Java. The gist of it is that we iterate over the entire set of records a first time to populate two hash tables: one that associates a subject to all the different objects it is mapped to, and one that associates an object to all the different subjects it is mapped to. Then we can iterate over the records a second time, and for every record we can immediately get (1) the number of different objects mapped to the same subject and (2) the number of different subjects mapped to the same object; the combination of those two values gives us the cardinality we are looking for. To deal with the concept of "scope", the "subjects" and "objects" that we use to fill the hash tables are not made of only the "subject_id" slot or the "object_id" slot, but also of all the slots that define the scope. For example, if the scope is `["predicate_id"]`, then for the following record: subject_id predicate_id object_id DO:1234 skos:exactMatch HP:5678 the "subject" string will contain both `DO:1234` and `skos:exactMatch`, and the "object" string will contain both `HP:5678` and `skos:exactMatch`.
1 parent cf75d07 commit cbbbe48

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

src/sssom/util.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,6 +393,100 @@ def condense(self) -> List[str]:
393393
self.df.drop(columns=condensed, inplace=True)
394394
return condensed
395395

396+
def infer_cardinality(self, scope: List[str] = None) -> None:
397+
"""Infer cardinality values in the set.
398+
399+
This method will automatically fill the `mapping_cardinality` slot for
400+
all records in the set, overwriting any pre-existing values.
401+
402+
See <https://mapping-commons.github.io/sssom/spec-model/#mapping-cardinality-and-cardinality-scope>
403+
for more information about cardinality computation,
404+
<https://mapping-commons.github.io/sssom/spec-model/#literal-mappings>
405+
for how to deal with literal mapping records, and
406+
<https://mapping-commons.github.io/sssom/spec-model/#representing-unmapped-entities>
407+
for how to deal with mapping records involving `sssom:NoTermFound`.
408+
409+
:param scope: A list of slot names that defines the subset of the
410+
records in which cardinality will be computed. For
411+
example, with a scope of `['predicate_id']`, for any
412+
given record the cardinality will be computed relatively
413+
to the subset of records that have the same predicate.
414+
The default is an empty list, meaning that cardinality is
415+
computed relatively to the entire set of records.
416+
"""
417+
if scope is None:
418+
scope = []
419+
subjects_by_object = {} # Unique subjects for any given object
420+
objects_by_subject = {} # Unique objects for any given subject
421+
422+
# Helper function to transform a row into a string that represents
423+
# a subject (or object) in a given scope; `side` is either `subject`
424+
# or `object`.
425+
def _to_string(row, side):
426+
# We prepend a one-letter code (`L` or `E`) to the actual subject
427+
# or object so that literal and non-literal mapping records are
428+
# always distinguishable and can be counted separately.
429+
if row.get(f"{side}_type") == "rdfs literal":
430+
s = "L\0" + row.get(f"{side}_label", "")
431+
else:
432+
s = "E\0" + row.get(f"{side}_id", "")
433+
for slot in scope:
434+
s += "\0" + row.get(slot, "")
435+
return s
436+
437+
# We iterate over the records a first time to collect the different
438+
# objects mapped to each subject and vice versa
439+
for _, row in self.df.iterrows():
440+
if (
441+
row.get("subject_id") == "sssom:NoTermFound"
442+
or row.get("object_id") == "sssom:NoTermFound"
443+
):
444+
# Mappings to sssom:NoTermFound are ignored for cardinality computations
445+
continue
446+
447+
subj = _to_string(row, "subject")
448+
obj = _to_string(row, "object")
449+
450+
subjects_by_object.setdefault(obj, set()).add(subj)
451+
objects_by_subject.setdefault(subj, set()).add(obj)
452+
453+
# Second iteration to compute the actual cardinality values. Since we
454+
# must not modify a row while we are iterating over the dataframe, we
455+
# collect the values in a separate array.
456+
cards = []
457+
for _, row in self.df.iterrows():
458+
# Special cases involving sssom:NoTermFound on either side
459+
if row.get("subject_id") == "sssom:NoTermFound":
460+
if row.get("object_id") == "sssom:NoTermFound":
461+
cards.append("0:0")
462+
else:
463+
cards.append("0:1")
464+
elif row.get("object_id") == "sssom:NoTermFound":
465+
cards.append("1:0")
466+
else:
467+
# General case
468+
n_subjects = len(subjects_by_object[_to_string(row, "object")])
469+
n_objects = len(objects_by_subject[_to_string(row, "subject")])
470+
471+
if n_subjects == 1:
472+
if n_objects == 1:
473+
cards.append("1:1")
474+
else:
475+
cards.append("1:n")
476+
else:
477+
if n_objects == 1:
478+
cards.append("n:1")
479+
else:
480+
cards.append("n:n")
481+
482+
# Add the computed values to the dataframe
483+
self.df["mapping_cardinality"] = cards
484+
if len(scope) > 0:
485+
self.df["cardinality_scope"] = "|".join(scope)
486+
else:
487+
# No scope, so remove any pre-existing "cardinality_scope" column
488+
self.df.drop(columns="cardinality_scope", inplace=True, errors="ignore")
489+
396490

397491
def _standardize_curie_or_iri(curie_or_iri: str, *, converter: Converter) -> str:
398492
"""Standardize a CURIE or IRI, returning the original if not possible.

0 commit comments

Comments
 (0)