-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
KGs should conform to the biolink "type system". This would allow to catch systematic errors in the KG early on (either at ingestion or integration).
A few suggestions:
- Minimally, each node's
category(and edge'spredicate) should be a valid biolink-class (biolink-predicate). Surprisingly that's not the case (e.g.biolink:Vitaminexists in our KG, but is NOT a valid biolink-class, similar for predicatebiolink:contraindicated_for). Also abstract classes should probably not exists in thecategoryof a node. - Edge types in biolink have specific
domainandrange, i.e. subject/object types, which we should enforce in the graph, e.g.biolink:in_taxonedges can only connect a ThingWithTaxon to a OrganismTaxon. Frequently violated at the moment. - Nodes often have more than one
category(seeall_categories). Often it's superclass-subclass relations, e.g. a node hasall_categories =["Protein", "Polypeptide"], which are valid. However, certain combinations ofcategoryon the same node point to errors (e.g. a node can't really be aDiseaseand aGene; the gene might be mutated in the disease, but those are still different concepts that should not be mixed up )
Comments
- is easy, something like
import bmt
import polars as pl
B = bmt.Toolkit('https://raw.githubusercontent.com/biolink/biolink-model/refs/heads/master/biolink-model.yaml')
valid_classes = [B.get_element(el_name)['class_uri']for el_name in B.get_all_classes()]
df_nodes.with_columns(
# not sure how do check for set-equality in polars, this one works though:
valid_biolink=pl.col("all_categories").list.set_intersection(valid_classes).list.len() == pl.col("all_categories").list.len(),
)- fairly easy too. One just needs to take into account the inheritance, e.g. if an edge type has
domain=="ThingWithTaxon", any subclass is a valid subject - Harder. Anything that adheres to the biolink class hierarchy is definitely valid, but the rest is tricky (if the node's
all_categoriesviolates the class hierarchy, it's not neccessarily wrong, e.g.:- genes and proteins are often mixed in a single node(i guess the type should really be GeneOrGeneProduct then)
- A node might be both a Protein and a Drug (e.g. antibody)
- Proteins are sometimes SmallMolecules (if its just a few AAs) ...
We'd need to come up with a "blacklist" of category combinations that are "wrong" (.e.gDisease,Gene)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels