-
Notifications
You must be signed in to change notification settings - Fork 0
Description
@hansenp I finally managed to solve the issue we discussed today! TL;DR is: the problem is probably in pyboqa, but let us make sure.
I noticed it was not the 0/7 or similar to be problematic, but some lines of HPOA contain things like:
OMIM:191100 Tuberous sclerosis-1 HP:0012798 PMID:10852420 PCS 20/78 FEMALE P HPO:probinson[2019-11-28]
OMIM:191100 Tuberous sclerosis-1 HP:0012798 PMID:16485546 PCS HP:0040284 MALE P HPO:probinson[2019-11-28]
OMIM:191100 Tuberous sclerosis-1 HP:0012798 PMID:29196670 PCS 0/5 P ORCID:0000-0002-0736-9199[2024-06-06]
In DiseaseDataParseIngest each line is read on its own, so because of the first line the HPO term HP:0012798 will be included, and because of the third line it will also be in the excluded set. This means that we are correctly including the term. If I manually forcefully remove it when creating diseaseData, the counts match perfectly with pyboqa. I strongly suspect pyboqa considers this as excluded, and this is the origin of the mismatch.
We should make sure that the case in which multiple lines of HPOA contain the same HPO term but have different PMID as sources we create one single frequency term.
I think eventually we want to use phenol for parsing phenotype-diseases associations, but before deleting this issue I want to make sure
- This problem does not turn up there, too.
- How disease-phenotype data will look like in Exomiser.