Skip to content

Commit 227a014

Browse files
committed
update experiment evidence codes as per DeepGo SE
- #36 (comment)
1 parent 3a4e007 commit 227a014

File tree

3 files changed

+10
-5
lines changed

3 files changed

+10
-5
lines changed

chebai/preprocessing/datasets/go_uniprot.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@
4343
"IEP",
4444
"TAS",
4545
"IC",
46+
"HTP",
47+
"HDA",
48+
"HMP",
49+
"HGI",
50+
"HEP",
4651
}
4752

4853
# https://github.com/bio-ontology-research-group/deepgo/blob/d97447a05c108127fee97982fd2c57929b2cf7eb/aaindex.py#L8
@@ -414,7 +419,7 @@ def _get_swiss_to_go_mapping(self) -> pd.DataFrame:
414419
415420
Quote from the DeepGo Paper:
416421
`We select proteins with annotations having experimental evidence codes
417-
(EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC) and filter the proteins by a
422+
`EXPERIMENTAL_EVIDENCE_CODES` and filter the proteins by a
418423
maximum length of 1002, ignoring proteins with ambiguous amino acid codes
419424
(B, O, J, U, X, Z) in their sequence.`
420425

chebai/preprocessing/datasets/protein_pretraining.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,8 @@ def _download_required_data(self) -> str:
9696
def _parse_protein_data_for_pretraining(self) -> pd.DataFrame:
9797
"""
9898
Parses the Swiss-Prot data and returns a DataFrame containing Swiss-Prot proteins which does not have any valid
99-
Gene Ontology(GO) label. A valid GO label is the one which has one of the following evidence code
100-
(EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).
99+
Gene Ontology(GO) label. A valid GO label is the one which has one of the following evidence code defined in
100+
`EXPERIMENTAL_EVIDENCE_CODES`.
101101
102102
The DataFrame includes the following columns:
103103
- "swiss_id": The unique identifier for each Swiss-Prot record.

tests/unit/mock_data/ontology_mock_data.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -668,8 +668,8 @@ def get_UniProt_raw_data() -> str:
668668
- **Swiss_Prot_11**: Has only Invalid GO class but lacks a sequence.
669669
670670
Note:
671-
A valid GO label is the one which has one of the following evidence code
672-
(EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC).
671+
A valid GO label is the one which has one of the following evidence code defined in
672+
`EXPERIMENTAL_EVIDENCE_CODES`.
673673
674674
Returns:
675675
str: The raw UniProt data in string format.

0 commit comments

Comments
 (0)