-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
Current SSSOM consumers/producers in this repo use a mix of ad hoc string heuristics for object_id parsing (hostname checks, OBO-only regexes). This is brittle and diverges from SSSOM-native tooling.
Why this matters
- We already observed mixed vs IRI vs CURIE forms in legacy mapping files.
- Heuristic parsing can silently misclassify prefixes and break downstream analysis.
Proposal
- Add curies and sssom-py as dependencies for mapping workflows.
- Build one shared utility for:
- reading curie_map from SSSOM metadata,
- expansion/compaction through curies.Converter,
- strict normalization rules for subject_id/predicate_id/object_id.
- Replace remaining heuristic parsing in scripts under:
- metpo/analysis/
- metpo/presentations/
- metpo/scripts/
- metpo/pipeline/
- Add validation step that fails when unresolved/ambiguous CURIE prefixes remain.
Acceptance criteria
- No script relies on hostname substring matching for prefix classification.
- SSSOM files round-trip via SSSOM tooling without identifier shape regressions.
- Curie map is complete for every CURIE prefix emitted in outputs.
- CI includes at least one check using SSSOM-native parser/validator.
Related
- Follow-up to current in-branch curie_map-driven cleanup in chromadb_semantic_mapper.py and key consumers.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels