-
Notifications
You must be signed in to change notification settings - Fork 31
Did you compare your results with BioVec by EhsaneddinAsgari #10
Description
Thank you so much for your great work.
I read a paper called "DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences" by Asgari, E., Poerner, N., McHardy, A., & Mofrad, M.. (https://github.com/ehsanasgari/DeepPrime2Sec)
In the paper, he mentioned he used five kinds of features to do the prediction of protein secondary structure from the protein primary sequence. These five features are:
- One-hot vector representation (length: 21) --- onehot: vector representation indicating which amino acid exists at each specific position, where each index in the vector indicates the presence or absence of that amino acid.
- ProtVec embedding (length: 50) --- protvec: representation trained using Skip-gram neural network on protein amino acid sequences (ProtVec). The only difference would be character-level training instead of n-gram based training.
3. Contextualized embedding (length: 300) --- elmo: we use the contextualized embedding of the amino acids trained in the course of language modeling, known as ELMo, as a new feature for the secondary structure task. Contextualized embedding is the concatenation of the hidden states of a deep bidirectional language model. The main difference between ProtVec embedding and ELMO embedding is that the ProtVec embedding for a given amino acid or amino acid k-mer is fixed and the representation would be the same in different sequences. However, the contextualized embedding, as it is clear from its name, is an embedding of word changing based on its context. We train ELMo embedding of amino acids using UniRef50 dataset in the dimension size of 300.
4. Position Specific Scoring Matrix (PSSM) features (length: 21) --- pssm: PSSM is amino acid substitution scores calculated on protein multiple sequence alignment of homolog sequences for each given position in the protein sequence.
5. Biophysical features (length: 16) --- biophysical For each amino acid we create a normalized vector of their biophysical properties, e.g., flexibility, instability, surface accessibility, kd-hydrophobicity, hydrophilicity, and etc.
However, he didn't show how to do these feature extraction. I am not sure if you compared your embedding to his work.
By the way,
In my ML project, I want to embed a protein to a vector and then use DL models to do drug-protein interaction prediction. Do you have an example to show how to use it similar to RDkit, eg.
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=512)?
Many thanks!