Skip to content

Validate KGs against biolink schema #27

@redst4r

Description

@redst4r

KGs should conform to the biolink "type system". This would allow to catch systematic errors in the KG early on (either at ingestion or integration).

A few suggestions:

  1. Minimally, each node's category (and edge's predicate) should be a valid biolink-class (biolink-predicate). Surprisingly that's not the case (e.g. biolink:Vitamin exists in our KG, but is NOT a valid biolink-class, similar for predicate biolink:contraindicated_for). Also abstract classes should probably not exists in the category of a node.
  2. Edge types in biolink have specific domain and range, i.e. subject/object types, which we should enforce in the graph, e.g. biolink:in_taxon edges can only connect a ThingWithTaxon to a OrganismTaxon. Frequently violated at the moment.
  3. Nodes often have more than one category (see all_categories). Often it's superclass-subclass relations, e.g. a node has all_categories =["Protein", "Polypeptide"], which are valid. However, certain combinations of category on the same node point to errors (e.g. a node can't really be a Disease and a Gene; the gene might be mutated in the disease, but those are still different concepts that should not be mixed up )

Comments

  1. is easy, something like
import bmt
import polars as pl
B = bmt.Toolkit('https://raw.githubusercontent.com/biolink/biolink-model/refs/heads/master/biolink-model.yaml')
valid_classes = [B.get_element(el_name)['class_uri']for el_name in B.get_all_classes()]

df_nodes.with_columns(
   # not sure how do check for set-equality in polars, this one works though:
    valid_biolink=pl.col("all_categories").list.set_intersection(valid_classes).list.len() == pl.col("all_categories").list.len(),
)
  1. fairly easy too. One just needs to take into account the inheritance, e.g. if an edge type has domain=="ThingWithTaxon", any subclass is a valid subject
  2. Harder. Anything that adheres to the biolink class hierarchy is definitely valid, but the rest is tricky (if the node's all_categories violates the class hierarchy, it's not neccessarily wrong, e.g.:
    • genes and proteins are often mixed in a single node(i guess the type should really be GeneOrGeneProduct then)
    • A node might be both a Protein and a Drug (e.g. antibody)
    • Proteins are sometimes SmallMolecules (if its just a few AAs) ...
      We'd need to come up with a "blacklist" of category combinations that are "wrong" (.e.g Disease, Gene)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions