Skip to content

Add METPO predicate pairs for high-volume unresolved MetaTraits categories #353

@turbomam

Description

@turbomam

Problem

The in-sheet resolver (resolve-metatraits-in-sheets) can't map 14 composed trait categories because metpo-properties.tsv lacks predicate pairs for them. These 14 categories account for ~2.85M assertions out of 5.34M total (53%) in the MetaTraits NCBI species-level data — making this the single most impactful gap in KGX expressibility.

The resolver currently covers 77% of trait types but only 47% of assertions by volume, because the unresolved categories are the highest-frequency ones.

Assertion volume by unresolved category

Source: metatraits.ncbi_species_summary collection (54,654 species, 5,343,971 trait summary items).

Category Unresolved cards Species-level assertions
growth 27 1,104,805
enzyme activity 132 430,726
degradation 13 272,746
hydrolysis 13 271,774
reduction 4 196,983
produces 232 162,157
oxidation 6 153,029
aerobic growth 2 77,314
utilizes 80 75,998
carbon source 20 60,959
builds acid from 16 24,658
assimilation 29 13,085
energy source 6 1,035
nitrogen source 6 345
Total unresolved 586 ~2,845,614 (53%)

For comparison, resolved composed categories:

Category Assertions
electron acceptor 386,615
denitrification 115,944
respiration 78,549
fermentation 45,667
builds gas from 306
builds base from 14
aerobic catabolization 10
anaerobic catabolization 4
Total resolved composed ~627,109 (12%)

The remaining ~35% are base/boolean traits (gram stain, sporulation, oxygen preference, motility, etc.) which are resolved via biolink:has_phenotype.

What needs to happen

For each unresolved category, add a positive/negative predicate pair to src/templates/metpo-properties.tsv, following the existing pattern (e.g., ferments / does not ferment). Some categories like growth and enzyme activity may need modeling discussion before predicate assignment.

Priority order (by assertion volume)

  1. growth — 1.1M assertions, needs "uses for growth" / "does not use for growth" (NOTE: predicates METPO:2000012/METPO:2000038 already exist but the resolver reports these cards as unresolved — likely a CHEBI object gap, not a predicate gap)
  2. enzyme activity — 431K assertions, predicates exist (METPO:2000302/METPO:2000303) but resolution fails for 132 cards (missing CHEBI/GO objects in source data)
  3. degradation — 273K, predicates exist (METPO:2000007/METPO:2000033) — similar CHEBI gap
  4. hydrolysis — 272K, predicates exist (METPO:2000013/METPO:2000039) — similar CHEBI gap
  5. reduction — 197K, predicates exist (METPO:2000017/METPO:2000044) — similar CHEBI gap
  6. produces — 162K, predicates exist (METPO:2000202/METPO:2000222) — similar CHEBI gap
  7. oxidation — 153K, predicates exist (METPO:2000016/METPO:2000042) — similar CHEBI gap
  8. Lower-volume categories (aerobic growth, utilizes, carbon source, etc.)

Refined diagnosis: Many of these categories do have predicate pairs already. The resolution failure is primarily due to missing CHEBI object mappings for the composed substrates. The fix is likely a combination of:

  • Adding CHEBI mappings for substrates that MetaTraits doesn't provide ontology CURIEs for
  • Extending the resolver's object normalization to handle more source formats

Resolution table reference

  • Current resolution table: data/mappings/metatraits_in_sheet_resolution.tsv
  • Coverage report: data/mappings/metatraits_in_sheet_resolution_report.md
  • Property template: src/templates/metpo-properties.tsv

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions