Skip to content

Commit 9ecdb2e

Browse files
committed
Add the infer_cardinality method.
Add a new `infer_cardinality` method to the `MappingSetDataFrame` to fill the `mapping_cardinality` slot with computed cardinality values. The approach used here is more or less a direct Python translation of my existing implementation in SSSOM-Java. The gist of it is that we iterate over the entire set of records a first time to populate two hash tables: one that associates a subject to all the different objects it is mapped to, and one that associates an object to all the different subjects it is mapped to. Then we can iterate over the records a second time, and for every record we can immediately get (1) the number of different objects mapped to the same subject and (2) the number of different subjects mapped to the same object; the combination of those two values gives us the cardinality we are looking for. To deal with the concept of "scope", the "subjects" and "objects" that we use to fill the hash tables are not made of only the "subject_id" slot or the "object_id" slot, but also of all the slots that define the scope. For example, if the scope is `["predicate_id"]`, then for the following record: subject_id predicate_id object_id DO:1234 skos:exactMatch HP:5678 the "subject" string will contain both `DO:1234` and `skos:exactMatch`, and the "object" string will contain both `HP:5678` and `skos:exactMatch`.
1 parent cf75d07 commit 9ecdb2e

File tree

1 file changed

+92
-0
lines changed

1 file changed

+92
-0
lines changed

src/sssom/util.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -393,6 +393,98 @@ def condense(self) -> List[str]:
393393
self.df.drop(columns=condensed, inplace=True)
394394
return condensed
395395

396+
def infer_cardinality(self, scope: List[str] = []) -> None:
397+
"""Infer cardinality values in the set.
398+
399+
This method will automatically fill the `mapping_cardinality` slot for
400+
all records in the set, overwriting any pre-existing values.
401+
402+
See <https://mapping-commons.github.io/sssom/spec-model/#mapping-cardinality-and-cardinality-scope>
403+
for more information about cardinality computation,
404+
<https://mapping-commons.github.io/sssom/spec-model/#literal-mappings>
405+
for how to deal with literal mapping records, and
406+
<https://mapping-commons.github.io/sssom/spec-model/#representing-unmapped-entities>
407+
for how to deal with mapping records involving `sssom:NoTermFound`.
408+
409+
:param scope: A list of slot names that defines the subset of the
410+
records in which cardinality will be computed. For
411+
example, with a scope of `['predicate_id']`, for any
412+
given record the cardinality will be computed relatively
413+
to the subset of records that have the same predicate.
414+
The default is an empty list, meaning that cardinality is
415+
computed relatively to the entire set of records.
416+
"""
417+
subjects_by_object = {} # Unique subjects for any given object
418+
objects_by_subject = {} # Unique objects for any given subject
419+
420+
# Helper function to transform a row into a string that represents
421+
# a subject (or object) in a given scope; `side` is either `subject`
422+
# or `object`.
423+
def _to_string(row, side):
424+
# We prepend a one-letter code (`L` or `E`) to the actual subject
425+
# or object so that literal and non-literal mapping records are
426+
# always distinguishable and can be counted separately.
427+
if row.get(f"{side}_type") == "rdfs literal":
428+
s = "L\0" + row.get(f"{side}_label", "")
429+
else:
430+
s = "E\0" + row.get(f"{side}_id", "")
431+
for slot in scope:
432+
s += "\0" + row.get(slot, "")
433+
return s
434+
435+
# We iterate over the records a first time to collect the different
436+
# objects mapped to each subject and vice versa
437+
for _, row in self.df.iterrows():
438+
if (
439+
row.get("subject_id") == "sssom:NoTermFound"
440+
or row.get("object_id") == "sssom:NoTermFound"
441+
):
442+
# Mappings to sssom:NoTermFound are ignored for cardinality computations
443+
continue
444+
445+
subj = _to_string(row, "subject")
446+
obj = _to_string(row, "object")
447+
448+
subjects_by_object.setdefault(obj, set()).add(subj)
449+
objects_by_subject.setdefault(subj, set()).add(obj)
450+
451+
# Second iteration to compute the actual cardinality values. Since we
452+
# must not modify a row while we are iterating over the dataframe, we
453+
# collect the values in a separate array.
454+
cards = []
455+
for _, row in self.df.iterrows():
456+
# Special cases involving sssom:NoTermFound on either side
457+
if row.get("subject_id") == "sssom:NoTermFound":
458+
if row.get("object_id") == "sssom:NoTermFound":
459+
cards.append("0:0")
460+
else:
461+
cards.append("0:1")
462+
elif row.get("object_id") == "sssom:NoTermFound":
463+
cards.append("1:0")
464+
else:
465+
# General case
466+
n_subjects = len(subjects_by_object[_to_string(row, "object")])
467+
n_objects = len(objects_by_subject[_to_string(row, "subject")])
468+
469+
if n_subjects == 1:
470+
if n_objects == 1:
471+
cards.append("1:1")
472+
else:
473+
cards.append("1:n")
474+
else:
475+
if n_objects == 1:
476+
cards.append("n:1")
477+
else:
478+
cards.append("n:n")
479+
480+
# Add the computed values to the dataframe
481+
self.df["mapping_cardinality"] = cards
482+
if len(scope) > 0:
483+
self.df["cardinality_scope"] = "|".join(scope)
484+
else:
485+
# No scope, so remove any pre-existing "cardinality_scope" column
486+
self.df.drop(columns="cardinality_scope", inplace=True, errors="ignore")
487+
396488

397489
def _standardize_curie_or_iri(curie_or_iri: str, *, converter: Converter) -> str:
398490
"""Standardize a CURIE or IRI, returning the original if not possible.

0 commit comments

Comments
 (0)