Skip to content
Evan Staton edited this page Jun 9, 2014 · 21 revisions

In the following examples we will start with a file of DNA sequences for a set of genes of interest and we will end up with a mapping of GO terms for each sequence.


Finding Open Reading Frames (ORFs)

Starting with a file of DNA sequences (called "genes.fasta"), we first want to get the longest ORF for each gene and translate those sequences. In the simplest case, we can just specify the input and output files. For example,

hmmer2go getorf -i genes.fasta -o genes_orfs.faa --verbose

There are additional options that we could specify with this command. One useful option is to specify the minimum length of ORFs to report. The default minimum ORF length is 80 nt, but would could increase this.

hmmer2go getorf -i genes.fasta -o genes_orfs.faa -l 100 --verbose

For more information on this command, and for an explanation of each option, read the documentation with:

perldoc hmmer2go getorf

Finding protein domain matches

With the ORFs obtained from the hmmer2go getorf command we can now search for domain matches.

hmmer2go search -i genes_orfs.faa -d Pfam-A.hmm 

The above command will create three files: genes_orfs_hmmscan-pfamA.out, genes_orfs_hmmscan-pfamA.domtblout, and genes_orfs_hmmscan-pfamA.tblout

Obtaining Pfam->GO mappings

We will now use the table of domain matches to map GO terms. To do this we first need to download the Pfam->Gene Ontology mappings. This can be done with a single command:

hmmer2go fetch

The above command creates the file: pfam2go.

Mapping domain matches to GO terms

Now we can map the protein domain matches to GO terms.

hmmer2go mapterms -i genes_orfs_hmmscan-pfamA.tblout -p pfam2go -o genes_orfs_hmmscan-pfamA_GO.tsv --map

This last command will create two output files: genes_orfs_hmmscan-pfamA_GO.tsv, and genes_orfs_hmmscan-pfamA_GO_GOterm_mapping.tblout

The first output file is a tab-delimited table with a description of each domain, including the GO terms and the associated functions. The last file is a two column table with the sequence name in the first column and the GO terms associated with that sequence in the second column.

Clone this wiki locally