Skip to content

PREFIX.pergeno.aa_mutations.csv

Xiaolong Cao edited this page Apr 1, 2021 · 5 revisions

This file includes amino acid changing annotations for each proteins.

files are in csv format (with \t as separator). Transposed table looks like:

protein_id ENSP00000437362 ENSP00000446015 ENSP00000450505 ENSP00000451203 ENSP00000452431
protein_id_fasta ENSP00000437362.1 ENSP00000446015.1 ENSP00000450505.1 ENSP00000451203.1 ENSP00000452431.1
seqname 14 14 14 14 14
strand + + + + +
frameChange
stopGain
AA_stopGain
stopLoss
stopLoss_pos
n_variant_AA 1.0 2.0 2.0 1.0 1.0
n_deletion_AA
n_insertion_AA
variant_AA F70S(14-21888371-T-C) P50Q(14-21924450-C-A);Q76E(14-21924527-C-G) E77K(14-21979008-G-A);S78G(14-21979011-A-G) S103L(14-22086905-C-T) T25P(14-22124030-A-C)
insertion_AA
deletion_AA
len_ref_AA 113 116 113 121 109
len_alt_AA 113.0 116.0 113.0 121.0 109.0

Columns:

  • protein_id: protein_id used in perGeno analysis
  • protein_id_fasta: protein id that is stored in fasta file
  • seqname: chromosome name
  • strand: + or -. Strand of proteins in chromosome.
  • frameChange: True or False. If there is a frame change variation.
    • fame change is defined by the last reading frame. So it means, for example, if there is a single nucleotide insertion and a single nucleotide deletion, the reading frame will be considered as unchanged as the last reading frame is unchanged.
  • stopGain: True or False. If there is a stopgain variation.
  • AA_stopGain: A string describes amino acid (AA) that is changed to a stop codon. It looks like E90*(chr18-63712604-G-T), which means that the 90th amino acid E in the reference protein sequence is mutated to a stop codon, and the variation is chr18-63712604-G-T. If it looks like -103*(chr8-18067100-T-TGCACCTGTGCTGTATATCTAAGACATACA), it means that variations change the protein sequence in a complex way, so the reference protein sequence is shorter or we cannot assign an AA at this site. Here 103 is just the codon number counting from the start codon in the transcript sequences. In the example here, this insertion introduced a stop codon.
  • stopLoss: True or False. If there is a stoploss mutation.
  • stopLoss_pos: A string describes position of stoploss in protein sequence. 179(chr17-7254884-T-G), the AA before the stop is the 179 in reference protein sequence, and variant chr17-7254884-T-G caused this stoploss. 187(), a stop-loss caused by variants other than substitution.
  • nonStandardStopCodon: Value will be 1 if translation is stop at a position that is not a stop codon.
  • n_variant_AA: count of AA substitution.
  • n_deletion_AA: count of AA deletion.
  • n_insertion_AA: count of AA insertion.
  • variant_AA: A string describes the substituted AAs. For example, G44E(chr22-25763322-G-A);W547C(chr22-25770933-G-C);W661R(chr22-25777694-T-C);H1119Q(chr22-25843883-C-A). G44E means the 44th G is changed to E, and this is caused by variant chr22-25763322-G-A.
  • insertion_AA: A string describes the inserted AAs. For example: -287P(chr1-47438996-T-TCCGCAC);-287H(chr1-47438996-T-TCCGCAC), which means two AAs, P and H were inserted after 287th AA, caused by variant chr1-47438996-T-TCCGCAC.
  • deletion_AA: A string describes the deleted AAs. For example: G1122-(chr21-45504511-CGGCCCCCCA-C);P1123-(chr21-45504511-CGGCCCCCCA-C);P1124-(chr21-45504511-CGGCCCCCCA-C), means three AAs, GPP were deleted due to variant chr21-45504511-CGGCCCCCCA-C
  • len_ref_AA: length of provided protein
  • len_alt_AA: length of changed protein

Note:

  • Some of cells may be empty, which usually means False or 0.
  • Currently, frame-shift variations were annotated as a serious of deletion_AAs and insertion_AAs.
  • For insertion_AA and deletion_AA annotation, if the change is caused by INDELs, the annotation string might be like "-443A()", where "-" means no AA.

Clone this wiki locally