-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial
In the following examples we will start with a file of DNA sequences for a set of genes of interest and we will end up with a mapping of GO terms for each sequence.
Starting with a file of DNA sequences (called "genes.fasta"), we first want to get the longest ORF for each gene and translate those sequences. In the simplest case, we can just specify the input and output files. For example,
hmmer2go getorf -i genes.fasta -o genes_orfs.faa --verbose
There are additional options that we could specify with this command. One useful option is to specify the minimum length of ORFs to report. The default minimum ORF length is 80 nt, but would could increase this.
hmmer2go getorf -i genes.fasta -o genes_orfs.faa -l 100 --verbose
For more information on this command, and for an explanation of each option, read the documentation with:
perldoc hmmer2go getorf
With the ORFs obtained from the hmmer2go getorf command we can now search for domain matches.
hmmer2go search -i genes_orfs.faa -d Pfam-A.hmm
The above command will create three files: genes_orfs_hmmscan-pfamA.out, genes_orfs_hmmscan-pfamA.domtblout, and genes_orfs_hmmscan-pfamA.tblout
We will now use the table of domain matches to map GO terms. To do this we first need to download the Pfam->Gene Ontology mappings. This can be done with a single command:
hmmer2go fetch
The above command creates the file: pfam2go.
Now we can map the protein domain matches to GO terms.
hmmer2go mapterms -i genes_orfs_hmmscan-pfamA.tblout -p pfam2go -o genes_orfs_hmmscan-pfamA_GO.tsv --map
This last command will create two output files: genes_orfs_hmmscan-pfamA_GO.tsv, and genes_orfs_hmmscan-pfamA_GO_GOterm_mapping.tblout
The first output file is a tab-delimited table with a description of each domain, including the GO terms and the associated functions. The last file is a two column table with the sequence name in the first column and the GO terms associated with that sequence in the second column.