-
Notifications
You must be signed in to change notification settings - Fork 2
Outputs of PrecisionProDB
Xiaolong Cao edited this page Dec 23, 2020
·
4 revisions
Depending on the settings of inputs, different files will be generated by PrecisionProDB.
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m gnomAD.variant.txt.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o text_variantThree files will be generated in the examples folder.
-
text_variant.pergeno.aa_mutations.csv: amino acid change annotations -
text_variant.pergeno.protein_all.fa: all proteins after incoporating the variants. -
text_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants.
Note:
- Protein names and descriptions in the fasta file is the same as in the input protein file, and adding the
Tabsymbol (\t) +changedorunchangedto indicate if the protein sequence is altered. - e.g.,
ENSP00000328207.6|ENST00000328596.10|ENSG00000186891.14|OTTHUMG00000001414|OTTHUMT00000004085.1|TNFRSF18-201|TNFRSF18|255 unchanged,ENSP00000424920.1|ENST00000502739.5|ENSG00000162458.13|OTTHUMG00000003079|OTTHUMT00000368044.1|FBLIM1-210|FBLIM1|144 changed.
cd Path_of_PrecisionProDB/examples
python ../src/PrecisionProDB.py -m celline.vcf.gz -g GENCODE.genome.fa.gz -p GENCODE.protein.fa.gz -f GENCODE.gtf.gz -o vcf_variantFive files will be generated in the examples folder.
-
vcf_variant.pergeno.aa_mutations.csv: annotations of amino acid changes. -
vcf_variant.pergeno.protein_all.fa: all proteins after incoporating the variants. -
vcf_variant.pergeno.protein_changed.fa: all proteins which are different from the input protein sequences after incoporating the variants. -
vcf_variant.vcf2mutation_1.tsv: variant file extracted from the VCF file in text format, the first alternative alleles. -
vcf_variant.vcf2mutation_2.tsv: variant file extracted from the VCF file in text format, the second alternative alleles.
Note:
- For altered proteins,
__1,__2,__12will be added to the ID of the protein.-
__1and__2mean that the alleles of the protein is from the first, second variant file, respectively. -
__12means that the first and second alleles altering the protein sequence are the same. - e.g.,
>ENSP00000308367.7|ENST00000312413.10|ENSG00000011021.23|OTTHUMG00000002299|-|CLCN6-201|CLCN6|847__12 changed,ENSP00000263934.6|ENST00000263934.10|ENSG00000054523.18|OTTHUMG00000001817|OTTHUMT00000005103.1|KIF1B-201|KIF1B|1770__2 changed,ENSP00000332771.4|ENST00000331433.5|ENSG00000186510.12|OTTHUMG00000009529|OTTHUMT00000026326.1|CLCNKA-201|CLCNKA|687__1 changed,ENSP00000493376.2|ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094|OTTHUMT00000003223.1|OR4F5-202|OR4F5|326 unchanged.
-
- The variant file looks like
chr pos ref alt chr1 52238 T G chr1 53138 TAA T chr1 55249 C CTATGG chr1 55299 C T chr1 61442 A G
The output is the same as as above. Additional three files will be generated.
When running PrecisionProDB for UniProt proteins, Ensembl model were used first to generate the changed proteins, then UniProt proteins were linked with Ensembl proteins. UniProt proteins without identical Ensembl models will not be changed.
-
PREFIX.uniprot_all.fa: all UniProt proteins after incoporating the variants. -
PREFIX.uniprot_changed.fa: all UniProt proteins which are different from the input protein sequences after incoporating the variants. -
PREFIX.uniprot_changed.tsv: link between UniProt_ID and other protein_id. It looks like:uniprot_id ref_id tr|A0A075B6H5|A0A075B6H5_HUMAN ENSP00000368747.3 sp|A0A075B6H8|KVD42_HUMAN ENSP00000374813.3 sp|A0A075B6I1|LV460_HUMAN ENSP00000374819.2 sp|A0A075B6I3|LVK55_HUMAN ENSP00000374821.3 sp|A0A075B6I4|LVX54_HUMAN ENSP00000374822.2
PrecisonProDB