-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The in-sheet resolver (resolve-metatraits-in-sheets) can't map 14 composed trait categories because metpo-properties.tsv lacks predicate pairs for them. These 14 categories account for ~2.85M assertions out of 5.34M total (53%) in the MetaTraits NCBI species-level data — making this the single most impactful gap in KGX expressibility.
The resolver currently covers 77% of trait types but only 47% of assertions by volume, because the unresolved categories are the highest-frequency ones.
Assertion volume by unresolved category
Source: metatraits.ncbi_species_summary collection (54,654 species, 5,343,971 trait summary items).
| Category | Unresolved cards | Species-level assertions |
|---|---|---|
growth |
27 | 1,104,805 |
enzyme activity |
132 | 430,726 |
degradation |
13 | 272,746 |
hydrolysis |
13 | 271,774 |
reduction |
4 | 196,983 |
produces |
232 | 162,157 |
oxidation |
6 | 153,029 |
aerobic growth |
2 | 77,314 |
utilizes |
80 | 75,998 |
carbon source |
20 | 60,959 |
builds acid from |
16 | 24,658 |
assimilation |
29 | 13,085 |
energy source |
6 | 1,035 |
nitrogen source |
6 | 345 |
| Total unresolved | 586 | ~2,845,614 (53%) |
For comparison, resolved composed categories:
| Category | Assertions |
|---|---|
electron acceptor |
386,615 |
denitrification |
115,944 |
respiration |
78,549 |
fermentation |
45,667 |
builds gas from |
306 |
builds base from |
14 |
aerobic catabolization |
10 |
anaerobic catabolization |
4 |
| Total resolved composed | ~627,109 (12%) |
The remaining ~35% are base/boolean traits (gram stain, sporulation, oxygen preference, motility, etc.) which are resolved via biolink:has_phenotype.
What needs to happen
For each unresolved category, add a positive/negative predicate pair to src/templates/metpo-properties.tsv, following the existing pattern (e.g., ferments / does not ferment). Some categories like growth and enzyme activity may need modeling discussion before predicate assignment.
Priority order (by assertion volume)
growth— 1.1M assertions, needs "uses for growth" / "does not use for growth" (NOTE: predicatesMETPO:2000012/METPO:2000038already exist but the resolver reports these cards as unresolved — likely a CHEBI object gap, not a predicate gap)enzyme activity— 431K assertions, predicates exist (METPO:2000302/METPO:2000303) but resolution fails for 132 cards (missing CHEBI/GO objects in source data)degradation— 273K, predicates exist (METPO:2000007/METPO:2000033) — similar CHEBI gaphydrolysis— 272K, predicates exist (METPO:2000013/METPO:2000039) — similar CHEBI gapreduction— 197K, predicates exist (METPO:2000017/METPO:2000044) — similar CHEBI gapproduces— 162K, predicates exist (METPO:2000202/METPO:2000222) — similar CHEBI gapoxidation— 153K, predicates exist (METPO:2000016/METPO:2000042) — similar CHEBI gap- Lower-volume categories (
aerobic growth,utilizes,carbon source, etc.)
Refined diagnosis: Many of these categories do have predicate pairs already. The resolution failure is primarily due to missing CHEBI object mappings for the composed substrates. The fix is likely a combination of:
- Adding CHEBI mappings for substrates that MetaTraits doesn't provide ontology CURIEs for
- Extending the resolver's object normalization to handle more source formats
Resolution table reference
- Current resolution table:
data/mappings/metatraits_in_sheet_resolution.tsv - Coverage report:
data/mappings/metatraits_in_sheet_resolution_report.md - Property template:
src/templates/metpo-properties.tsv
Cross-references
- METPO predicated for Madin and bactotraits Knowledge-Graph-Hub/kg-microbe#458 — METPO predicates for Madin/bactotraits
- Use synonym mappings for data sources from 'source mappings' tab in METPO sheet Knowledge-Graph-Hub/kg-microbe#480 — synonym mappings from METPO sheet
- define quality metrics esp regarding how mappings are used in KG-Microbe #204 — quality metrics for mapping coverage