Skip to content

Comprehensive edge pattern semantic assessment (65 unique patterns) #441

@turbomam

Description

@turbomam

Overview

Systematic semantic assessment of all 65 unique edge patterns in kg-microbe transformed data, analyzing subject category + predicate + object category combinations for Biolink Model compliance.

Analysis Date: 2025-11-20 (build from 2025-11-11, commit 77c42d8)

Data Files

  • kg_microbe_edge_patterns.tsv - Complete list of patterns with counts by source
  • edge_pattern_assessment.md - Detailed semantic assessment of each pattern

Pattern Format

source | subject_category | subject_prefix | predicate | object_category | object_prefix | count

Key Findings

High-Priority Issues (3 patterns, ~239K edges)

  1. organism --capable_of--> quality (186,197 edges) → biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438

    • EC codes miscategorized; should be processes not qualities
  2. organism --occurs_in--> medium (52,995 edges) → biolink:occurs_in used incorrectly for organism-medium relationships #440

    • Organisms don't occur in media; growth processes do
  3. chemical --occurs_in--> assay (340 edges)

    • Wrong direction/predicate; should be assay uses/has_input chemical

Data Quality Issues (~85K edges)

Multiple patterns with (unknown) or (empty) categories:

Node Categorization Issues

Non-Standard Predicates (23,289 edges)

  • biolink:produces (12,523 edges) - should be has_output
  • biolink:associated_with_resistance_to (10,297 edges) - not in Biolink Model
  • biolink:is_assessed_by (112 edges) - not in Biolink Model
  • biolink:has_chemical_role (357 edges) - may be valid

Valid Patterns (~1.1M edges)

18 patterns are semantically valid including:

  • organism --consumes--> chemical (429K edges)
  • organism --has_phenotype--> quality (210K edges)
  • environment --location_of--> organism (168K edges)
  • organism --subclass_of--> organism (173K edges)
  • organism --capable_of--> process (6K edges)

Recommendations

Priority 1: Invalid Patterns

  1. Fix EC node categories and capable_of usage (biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438)
  2. Fix organism-medium relationship (biolink:occurs_in used incorrectly for organism-medium relationships #440)
  3. Reverse chemical-assay relationship

Priority 2: Data Quality

  1. Assign missing categories (METPO phenotype nodes in madin_etal have missing categories #439)
  2. Review misclassified nodes
  3. Fix PATO-as-location patterns

Priority 3: Standardization

  1. Map produces to has_output
  2. Request associated_with_resistance_to in Biolink Model
  3. Review non-standard predicates

Methodology

Assessment based on:

  • Biolink Model predicate definitions and constraints
  • Semantic appropriateness of subject-predicate-object combinations
  • Standard usage patterns in biomedical knowledge graphs
  • Domain knowledge of microbiology and biological processes

Related Issues

Files

See attached analysis files in metpo repository for complete assessment details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions