Skip to content

EDirect Wiki!

manny809dr edited this page Nov 28, 2016 · 1 revision

EDirect

Introduction

Entrez Direct (EDirect) is an advanced method for accessing the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a UNIX terminal window. Functions take search terms from command-line arguments. Individual operations are combined to build multi-step queries. Record retrieval and formatting normally complete the process. EDirect also provides an argument-driven function that simplifies the extraction of data from document summaries or other results that are returned in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

Installation

EDirect will run on UNIX and Macintosh computers that have the Perl language installed, and under the Cygwin UNIX-emulation environment on Windows PCs.To install the EDirect software, copy the following commands and paste them into a terminal window:

cd ~
perl -MNet::FTP -e \
'$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
$ftp->login; $ftp->binary;
$ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

This downloads several scripts into an "edirect" folder in the user's home directory, and allows immediate execution of programs in that location.

The setup.sh script then downloads any missing Perl modules, and may print an additional command for updating the PATH environment variable in the user's configuration file. Copy that command, if present, and paste it into the terminal window to complete the installation process. The editing instructions will look something like:

echo "export PATH=\$PATH:\$HOME/edirect" >> $HOME/.bash_profile

If the EDirect scripts will be moved to another location, the configuration file can instead be modified manually using a text editor.

Examples

Searching and Filtering

esearch -db pubmed -query "opsin gene conversion" |
elink -related |
efilter -query "tetrachromacy"

XML Document Summaries

 esearch -db pubmed -query "Garber ED [AUTH] AND PNAS [JOUR]" |
 elink -related |
 efilter -query "mouse" |
 efetch -format docsum

will generate an XML document summary set:

 <eSummaryResult>
<DocumentSummarySet status="OK">
  <DbBuild>Build150407-2207m.3</DbBuild>
  <DocumentSummary>
    <Id>19650888</Id>
    <PubDate>2009 Aug 3</PubDate>
    <EPubDate>2009 Aug 3</EPubDate>
    <Source>BMC Microbiol</Source>
    <Authors>
      <Author>
        <Name>Cano V</Name>
        <AuthType>Author</AuthType>
        <ClusterID></ClusterID>
      </Author>
      <Author>
        <Name>Moranta D</Name>
        ...

Gene Aliases

esearch -db gene -query "Liver cancer AND Homo sapiens" | efetch -format docsum | xtract -pattern  
DocumentSummary -element Name OtherAliases OtherDesignations

Genomic sequence fastas from RefSeq assembly for specified taxonomic designation

wget `esearch -db assembly -query "Leptospira alstonii" | efetch -format docsum | xtract -pattern FtpPath -sep   
"\n" -element FtpPath | grep GCF | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}'`

Get organellar contigs from genbank

esearch -db nuccore -query "LKAM01" | efetch -format fasta

Retrieve fasta for genes on a specified sequence

asn2fasta -id NC_000023-feats gene_fasta
efetch -db nuccore -id NC_000023 -format gene_fasta

Complete taxonomy (KPCOFG) for taxids.

efetch -db taxonomy -id 9606,1234,81726 -format xml | xtract -pattern Taxon -tab "," -first TaxId  
ScientificName -group Taxon -KING "(-)" -PHYL "(-)" -CLSS "(-)" -ORDR "(-)" -FMLY "(-)" -GNUS "(-)" -block    
"*/Taxon" -match "Rank:kingdom" -KING ScientificName -block "*/Taxon" -match "Rank:phylum" -PHYL ScientificName 
-block "*/Taxon" -match "Rank:class" -CLSS ScientificName -block "*/Taxon" -match "Rank:order" -ORDR 
ScientificName -block "*/Taxon" -match "Rank:family" -FMLY ScientificName -block "*/Taxon" -match "Rank:genus" 
-GNUS ScientificName -group Taxon -tab "," -element "&KING" "&PHYL" "&CLSS" "&ORDR" "&FMLY" "&GNUS"

All the protein sequences from both the bacterial and archaeal complete genome sequences [User request.]

esearch -db assembly -query '("Bacteria"[Organism] OR "Archaea"[Organism]) AND (latest[filter] AND "complete   
genome"[filter] AND all[filter] NOT anomalous[filter])' | elink -target nuccore -batch | elink -target protein 
-batch | efetch -db protein -format fasta 

WARNING: Large result set.

Resources

Comments