-
Notifications
You must be signed in to change notification settings - Fork 13
Add the infer_cardinality
method.
#605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9ecdb2e
to
cbbbe48
Compare
Add a new `infer_cardinality` method to the `MappingSetDataFrame` to fill the `mapping_cardinality` slot with computed cardinality values. The approach used here is more or less a direct Python translation of my existing implementation in SSSOM-Java. The gist of it is that we iterate over the entire set of records a first time to populate two hash tables: one that associates a subject to all the different objects it is mapped to, and one that associates an object to all the different subjects it is mapped to. Then we can iterate over the records a second time, and for every record we can immediately get (1) the number of different objects mapped to the same subject and (2) the number of different subjects mapped to the same object; the combination of those two values gives us the cardinality we are looking for. To deal with the concept of "scope", the "subjects" and "objects" that we use to fill the hash tables are not made of only the "subject_id" slot or the "object_id" slot, but also of all the slots that define the scope. For example, if the scope is `["predicate_id"]`, then for the following record: subject_id predicate_id object_id DO:1234 skos:exactMatch HP:5678 the "subject" string will contain both `DO:1234` and `skos:exactMatch`, and the "object" string will contain both `HP:5678` and `skos:exactMatch`.
cbbbe48
to
697922b
Compare
src/sssom/util.py
Outdated
# objects mapped to each subject and vice versa | ||
for _, row in self.df.iterrows(): | ||
if ( | ||
row.get("subject_id") == "sssom:NoTermFound" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use constants for the fields and sssom:NoTermFound? I think many are already in the sssom.constants
module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not frankly convinced it brings any real benefit (it’s not as if those values could change), but OK with that, if only for consistency with the rest of the code. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Aside: is the sssom.constants
module entirely hand-written? Seems to me that most of those constant declarations could, and arguably should, be generated from the LinkML schema…)
Add a new
infer_cardinality
method to theMappingSetDataFrame
to fill themapping_cardinality
slot with computed cardinality values.The approach used here is more or less a direct Python translation of my existing implementation in SSSOM-Java. (As such, it may not be as “Pythonic” as the rest of SSSOM-Py’s codebase, but it does the job.)
The gist of it is that we iterate over the entire set of records a first time to populate two hash tables: one that associates a subject to all the different objects it is mapped to, and one that associates an object to all the different subjects it is mapped to. Then we can iterate over the records a second time, and for every record we can immediately get (1) the number of different objects mapped to the same subject and (2) the number of different subjects mapped to the same object; the combination of those two values gives us the cardinality we are looking for.
To deal with the concept of "scope", the "subjects" and "objects" that we use to fill the hash tables are not made of only the "subject_id" slot or the "object_id" slot, but also of all the slots that define the scope. For example, if the scope is
["predicate_id"]
, then given the following records:DO:1234
andskos:exactMatch
, and the "object" string will contain bothHP:5678
andskos:exactMatch
;DO:1234
andskos:broadMatch
, and the "object" string will contain bothMONDO:5678
andskos:broadMatch
.This way, the "subject" and "object" strings of these records will occupy different entries in the hash tables, thereby ensuring that they are counted separately (as they should since they are in different "scopes").