gbmunge

Munge GenBank files into FASTA sequences and tab-separated metadata.

This little C program will extract the following information from a GenBank file:

name
accession
length
submission date
host
country (supports both /country and /geo_loc_name qualifiers)
collection date

In addition to extracting this information, dates are reformatted e.g. 31-DEC-2001 becomes 2001-12-31, which makes them more digestible to downstream software like BEAST, and country names are cleaned and matched to ISO3 codes.

Usage

gbmunge [-h] -i <Genbank_file> -f <sequence_output> -o <metadata_output> [-t] [-s]

Genbank_file: filename of GenBank-formatted sequence file (normally downloaded as sequence.gb)
sequence_output: filename of FASTA output
metadata_output: filename of tab-separated metadata
-t: flag to
- only output sequences with collection dates (of any precision)
- to name sequences as {accession}_{collection_date}
-s: flag to include sequences in tab-delimited file

Building

Linux and macOS

git clone https://github.com/sdwfrost/gbmunge
cd gbmunge
make

This will build gbmunge in the src/ directory. Add the directory to the path, or move the executable somewhere.

Windows

There are several options for building on Windows:

Using WSL (Windows Subsystem for Linux) (Recommended):

# Install WSL with Ubuntu, then in the WSL terminal:
sudo apt update
sudo apt install build-essential
cd gbmunge
make

Using MSYS2/MinGW:

# Install MSYS2, then in MSYS2 terminal:
pacman -S mingw-w64-x86_64-gcc make
cd gbmunge
make

Using Visual Studio with vcpkg: Building natively with MSVC requires a POSIX-compatible regex library:

# Install PCRE2 via vcpkg
vcpkg install pcre2:x64-windows

# Compile with PCRE2 support (modify Makefile or compile manually)
cl /DGBMUNGE_USE_PCRE2 /I<vcpkg_include_path> gbfp.c gbmunge.c /link pcre2-8.lib

Using TRE regex library: Download TRE from https://github.com/laurikari/tre and place the POSIX-compatible regex.h in the src/ directory.

Testing

A Genbank file of MERS Coronavirus sequences is provided in the test/ directory.

cd test
../src/gbmunge -i sequence.gb -f sequence.fas -o sequence.txt -t

Here are the first few lines of output in sequence.txt:

name	accession	length	submission_date	host	country_original	country	countrycode	collection_date
JX869059_2012-06-13	JX869059	30119	2012-12-04	Homo sapiens	NA	NA	NA	2012-06-13
KC164505_2012-09-11	KC164505	30111	2013-07-12	Homo sapiens	United Kingdom	United Kingdom	GBR	2012-09-11
KC667074_2012-09-19	KC667074	30112	2013-04-30	Homo sapiens	United Kingdom: England	United Kingdom	GBR	2012-09-19
KC776174_2012-04	KC776174	30030	2013-03-25	Homo sapiens	Jordan	Jordan	JOR	2012-04

Credits

This code uses a slightly modified version of the GBParsy parser downloaded from the Google Code Archive. I found that the parsing of the LOCUS field wasn't working properly.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Make.inc		Make.inc
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gbmunge

Usage

Building

Linux and macOS

Windows

Testing

Credits

About

Uh oh!

Releases 1

Packages

Languages

License

sdwfrost/gbmunge

Folders and files

Latest commit

History

Repository files navigation

gbmunge

Usage

Building

Linux and macOS

Windows

Testing

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages