Parsing HPOA per row problematic for multiple entries for same OMIM-HPO association

@hansenp I finally managed to solve the issue we discussed today! TL;DR is: the problem is probably in pyboqa, but let us make sure. 

I noticed it was not the 0/7 or similar to be problematic, but some lines of HPOA contain things like:

OMIM:191100	Tuberous sclerosis-1		HP:0012798	PMID:10852420	PCS		20/78	FEMALE		P	HPO:probinson[2019-11-28]
OMIM:191100	Tuberous sclerosis-1		HP:0012798	PMID:16485546	PCS		HP:0040284	MALE		P	HPO:probinson[2019-11-28]
OMIM:191100	Tuberous sclerosis-1		HP:0012798	PMID:29196670	PCS		0/5			P	ORCID:0000-0002-0736-9199[2024-06-06]

In DiseaseDataParseIngest each line is read on its own, so because of the first line the HPO term HP:0012798 will be included, and because of the third line it will also be in the excluded set. This means that we are correctly including the term. If I manually forcefully remove it when creating diseaseData, the counts match perfectly with pyboqa. I strongly suspect pyboqa considers this as excluded, and this is the origin of the mismatch. 

We should make sure that the case in which multiple lines of HPOA contain the same HPO term but have different PMID as sources we create one single frequency term. 

I think eventually we want to use phenol for parsing phenotype-diseases associations, but before deleting this issue I want to make sure 
1. This problem does not turn up there, too.
2. How disease-phenotype data will look like in Exomiser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing HPOA per row problematic for multiple entries for same OMIM-HPO association #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parsing HPOA per row problematic for multiple entries for same OMIM-HPO association #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions