-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Overview
Systematic semantic assessment of all 65 unique edge patterns in kg-microbe transformed data, analyzing subject category + predicate + object category combinations for Biolink Model compliance.
Analysis Date: 2025-11-20 (build from 2025-11-11, commit 77c42d8)
Data Files
kg_microbe_edge_patterns.tsv- Complete list of patterns with counts by sourceedge_pattern_assessment.md- Detailed semantic assessment of each pattern
Pattern Format
source | subject_category | subject_prefix | predicate | object_category | object_prefix | count
Key Findings
High-Priority Issues (3 patterns, ~239K edges)
-
organism --capable_of--> quality (186,197 edges) → biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438
- EC codes miscategorized; should be processes not qualities
-
organism --occurs_in--> medium (52,995 edges) → biolink:occurs_in used incorrectly for organism-medium relationships #440
- Organisms don't occur in media; growth processes do
-
chemical --occurs_in--> assay (340 edges)
- Wrong direction/predicate; should be assay uses/has_input chemical
Data Quality Issues (~85K edges)
Multiple patterns with (unknown) or (empty) categories:
- METPO nodes missing categories (44,081 edges) → METPO phenotype nodes in madin_etal have missing categories #439
- ENVO environmental nodes (14,888 edges)
- isolation_source nodes (6,775 edges)
- Various NCBITaxon nodes (~2,000 edges)
Node Categorization Issues
- EC enzyme codes: duplicate entries with conflicting categories → EC enzyme codes have duplicate node entries with conflicting categories #437
- CHEBI chemicals categorized as EnvironmentalFeature (910 edges)
- NCBITaxon organisms categorized as EnvironmentalFeature (624 edges)
- PATO qualities used as locations (569 edges)
Non-Standard Predicates (23,289 edges)
biolink:produces(12,523 edges) - should behas_outputbiolink:associated_with_resistance_to(10,297 edges) - not in Biolink Modelbiolink:is_assessed_by(112 edges) - not in Biolink Modelbiolink:has_chemical_role(357 edges) - may be valid
Valid Patterns (~1.1M edges)
18 patterns are semantically valid including:
- organism --consumes--> chemical (429K edges)
- organism --has_phenotype--> quality (210K edges)
- environment --location_of--> organism (168K edges)
- organism --subclass_of--> organism (173K edges)
- organism --capable_of--> process (6K edges)
Recommendations
Priority 1: Invalid Patterns
- Fix EC node categories and capable_of usage (biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438)
- Fix organism-medium relationship (biolink:occurs_in used incorrectly for organism-medium relationships #440)
- Reverse chemical-assay relationship
Priority 2: Data Quality
- Assign missing categories (METPO phenotype nodes in madin_etal have missing categories #439)
- Review misclassified nodes
- Fix PATO-as-location patterns
Priority 3: Standardization
- Map
producestohas_output - Request
associated_with_resistance_toin Biolink Model - Review non-standard predicates
Methodology
Assessment based on:
- Biolink Model predicate definitions and constraints
- Semantic appropriateness of subject-predicate-object combinations
- Standard usage patterns in biomedical knowledge graphs
- Domain knowledge of microbiology and biological processes
Related Issues
- EC enzyme codes have duplicate node entries with conflicting categories #437 - EC enzyme code duplicate categories
- biolink:capable_of used with wrong object type (PhenotypicQuality instead of Occurrent) #438 - capable_of predicate range violation
- METPO phenotype nodes in madin_etal have missing categories #439 - Missing METPO categories
- biolink:occurs_in used incorrectly for organism-medium relationships #440 - occurs_in semantic violation
- Sphinx sources addition #30 - Broader KGX format compliance
Files
See attached analysis files in metpo repository for complete assessment details.