Skip to content

Anthea handoff: MetaTraits API download strategy and KGX transform integration #354

@turbomam

Description

@turbomam

Context

The metpo repo now has a working deterministic resolver pipeline for MetaTraits -> KGX:

  1. fetch-metatraits — scrapes the MetaTraits trait catalog (2,860 cards)
  2. resolve-metatraits-in-sheets — resolves traits to METPO predicates/objects using metpo_sheet.tsv + metpo-properties.tsv
  3. demo-metatraits-mongo-to-kgx — demonstrates the full transform from MongoDB records to KGX TSV

This issue tracks the operational handoff to Anthea for production MetaTraits API integration and KGX transform in her external repos.

Key handoff artifacts

Taxon list scoping

Anthea's current approach queries ~2.7M taxon IDs against the MetaTraits API. The MetaTraits species-level data covers ~55K NCBI species and ~65K GTDB species. Her query set should be scoped to ~120K IDs (the union of NCBI + GTDB species in MetaTraits) rather than the full NCBI taxonomy.

Reference crosswalk data available in local MongoDB metatraits.ncbi2gtdb (92,711 entries).

Implementation ownership split

metpo repo provides (deterministic, tested):

  • Trait -> predicate routing via resolution table
  • Predicate positive/negative pair selection
  • Object CURIE resolution (CHEBI, GO, EC)
  • KGX edge schema with Biolink compliance

Anthea owns (in kg-microbe or KG-Microbe-search):

  • API client design (endpoint selection, batching, rate limiting, retries)
  • Taxon list scoping and download orchestration
  • Persistence of API payloads
  • Integration tests against fixture records

Architecture constraints

Per the handoff doc:

  1. Do NOT reuse legacy kg-microbe transform code/config as implementation base
  2. Preferred layered architecture: acquire -> normalize -> resolve -> emit
  3. No fuzzy matching in predicate routing — use deterministic exact lookup
  4. Preserve positive/negative predicate distinction

Acceptance criteria

  1. Deterministic outputs for fixed fixture input
  2. Composed traits preserve substrate/object when present
  3. Positive/negative assay outcomes route to correct predicate pair
  4. No fallback to generic predicates for unresolved categories (fail loudly)

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions