Skip to content

Commit f73161f

Browse files
committed
relevant comment max seq len ignore
1 parent ed9247a commit f73161f

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

chebai_proteins/preprocessing/datasets/deepGO/go_uniprot.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,14 @@ def _get_swiss_to_go_mapping(self) -> pd.DataFrame:
457457

458458
if not record.sequence or len(record.sequence) > self.max_sequence_length:
459459
# Consider protein with only sequence representation and seq. length not greater than max seq. length
460+
461+
# DeepGO1 paper ignores proteins with sequence length greater than 1002: https://github.com/bio-ontology-research-group/deepgo/blob/master/aaindex.py#L9-L14
462+
# But DeepGO2 paper truncates the sequence to 1000: https://github.com/bio-ontology-research-group/deepgo2/blob/main/deepgo/aminoacids.py#L26-L33
463+
# Latest Discussion: https://github.com/ChEB-AI/python-chebai/issues/36#issuecomment-2385693976
464+
# So, we ignore proteins with sequence length greater than max_sequence_length
465+
# The rationale is that with only a partial representation of the protein sequence, the model may not learn effectively.
466+
# Also, proteins longer than 1002 are only 3.32% of the total proteins in Swiss-Prot dataset.
467+
# https://github.com/ChEB-AI/python-chebai/issues/36#issuecomment-2431460448
460468
continue
461469

462470
if any(aa in AMBIGUOUS_AMINO_ACIDS for aa in record.sequence):

0 commit comments

Comments
 (0)