Munge GenBank files into FASTA sequences and tab-separated metadata.
This little C program will extract the following information from a GenBank file:
- name
- accession
- length
- submission date
- host
- country (supports both
/countryand/geo_loc_namequalifiers) - collection date
In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001 becomes 2001-12-31, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.
gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]Genbank_file: filename of GenBank-formatted sequence file (normally downloaded assequence.gb)sequence_output: filename of FASTA outputmetadata_output: filename of tab-separated metadata-t: flag to- only output sequences with collection dates (of any precision)
- to name sequences as {accession}_{collection_date}
-s: flag to include sequences in tab-delimited file
git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
makeThis will build gbmunge in the src/ directory. Add the directory to the path, or move the executable somewhere.
There are several options for building on Windows:
-
Using WSL (Windows Subsystem for Linux) (Recommended):
# Install WSL with Ubuntu, then in the WSL terminal: sudo apt update sudo apt install build-essential cd gbmunge make
-
Using MSYS2/MinGW:
# Install MSYS2, then in MSYS2 terminal: pacman -S mingw-w64-x86_64-gcc make cd gbmunge make
-
Using Visual Studio with vcpkg: Building natively with MSVC requires a POSIX-compatible regex library:
# Install PCRE2 via vcpkg vcpkg install pcre2:x64-windows # Compile with PCRE2 support (modify Makefile or compile manually) cl /DGBMUNGE_USE_PCRE2 /I<vcpkg_include_path> gbfp.c gbmunge.c /link pcre2-8.lib
-
Using TRE regex library: Download TRE from https://github.com/laurikari/tre and place the POSIX-compatible
regex.hin thesrc/directory.
A Genbank file of MERS Coronavirus sequences is provided in the test/ directory.
cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -tHere are the first few lines of output in sequence.txt:
| name | accession | length | submission_date | host | country_original | country | countrycode | collection_date |
|---|---|---|---|---|---|---|---|---|
| JX869059_2012-06-13 | JX869059 | 30119 | 2012-12-04 | Homo sapiens | NA | NA | NA | 2012-06-13 |
| KC164505_2012-09-11 | KC164505 | 30111 | 2013-07-12 | Homo sapiens | United Kingdom | United Kingdom | GBR | 2012-09-11 |
| KC667074_2012-09-19 | KC667074 | 30112 | 2013-04-30 | Homo sapiens | United Kingdom: England | United Kingdom | GBR | 2012-09-19 |
| KC776174_2012-04 | KC776174 | 30030 | 2013-03-25 | Homo sapiens | Jordan | Jordan | JOR | 2012-04 |
This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.