-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial
In the following examples we will start with a file of DNA sequences for a set of genes of interest and we will end up with a mapping of GO terms for each sequence.
Starting with a file of DNA sequences (called "genes.fasta"), we first want to get the longest ORF for each gene and translate those sequences. In the simplest case, we can just specify the input and output files. For example,
hmmer2go getorf -i genes.fasta -o genes_orfs_aa.faa --verbose
We can easily get the nucleotides for those translated ORFs by specifying the output type with the "-t" option:
hmmer2go getorf -i genes.fasta -o genes_orfs_nt.fas -t 2 --verbose
There are additional options that we could specify with this command. One useful option is to specify the minimum length of ORFs to report. The default minimum ORF length is 80 nt, but would could increase this.
hmmer2go getorf -i genes.fasta -o genes_orfs.faa -l 100 --verbose
For some data sets, it may be of interest to find all the ORFs for a particular sequence, like a genomic contig, for example. That can be accomplished with the option to specify all ORFs should be returned.
hmmer2go getorf -i contigs.fasta -o contigs_orfs_all.faa --all --verbose
For more information on this command, and for an explanation of each option, read the documentation with:
hmmer2go getorf --man
With the ORFs obtained from the hmmer2go getorf command we can now search for domain matches.
hmmer2go search -i genes_orfs.faa -d Pfam-A.hmm -n 4
The above command specifies 4 CPUs to be used, and when completed will produce three files:
genes_orfs_Pfam-A.out,
genes_orfs_Pfam-A.domtblout,
and genes_orfs_Pfam-A.tblout
We will now use the table of domain matches to map GO terms. To do this we first need to download the Pfam->Gene Ontology mappings. This can be done with a single command:
hmmer2go fetchmap -o pfam2go
The above command creates the file: pfam2go.
Now we can map the protein domain matches to GO terms.
hmmer2go mapterms -i genes_orfs_Pfam-A.tblout -p pfam2go -o genes_orfs_Pfam-A_GO.tsv --map
This last command will create two output files:
genes_orfs_Pfam-A_GO.tsv,
and genes_orfs_Pfam-A_GO_GOterm_mapping.tsv
The first output file is a tab-delimited table with a description of each domain, including the GO terms and the associated functions. The last file is a two column table with the sequence name in the first column and the GO terms associated with that sequence in the second column.
Some statistical analysis packages, such as Ontologizer, require a GAF file as input. This can be created with the hmmer2go map2gaf command.
hmmer2go map2gaf -i genes_orfs_Pfam-A_GO_GOterm_mapping.tsv -o genes_orfs_Pfam-A_GO_GOterm_mapping.gaf -s 'Helianthus annuus'
The species name is required and it should be given in quotes as shown above.