This script allows detection of enriched composite motifs in genomic sequence data.
- Python 2.7.x, with the following packages installed:
- pip
- numpy and scipy (accounted for by the instructions below)
- MOODS v1.0.2.1 (ditto)
- JASPAR-formatted motifs
- bedtools-derived FASTA DNA sequence file(s)
MOODS 1.9.x and Python 3 are not currently supported due to breaking changes in the MOODS programming interface.
If you have multiple Python versions on your system, please ensure that the
first python
and pip
in your search path are the Python 2.7
versions. In a typical HPC environment, your module system (e.g. Environment
Modules) should handle this for you.
-
Clone the source from GitLab (MOODS v1.0.2.1 is provided as a submodule):
git clone --recursive https://github.com/weirauchlab/cosmo/cosmo.git cd cosmo
- as an alternative, download the latest release archive from GitHub, then unpack it into a local directory; see the DETAILED INSTALLATION section for instructions on downloading and building MOODS from source
-
If you have Docker:
docker build . -t cosmo # be patient, builds Python 2.7 from source! docker run --rm -it cosmo cosmo --help docker run --rm -it cosmo cosmostats --help docker run --rm -it -v .:/src cosmo make -j4 test
This method requires the least amount of work on your part, but it's the least tested. Use a bind mount (
-v
switch) if you want access to the sample data from the repository to runmake test
inside the container. -
If you want to use a local Python installation instead, make sure you have a version of
pip
that works with Python 2.7:wget https://bootstrap.pypa.io/pip/2.7/get-pip.py python get-pip.py
Use pip to install virtualenv if necessary, then create a Python 2.7 virtual environment and activate it:
# in the 'cosmo' subdirectory from 'git clone' above python -m virtualenv venv . venv/bin/activate
If you have some other Python 2.7 environment (such as Conda or Environment Modules), you probably know what to do on your own. If you have trouble with this step, try the Docker method described below.
-
Next, build the MOODS C library and install the Python module dependencies into the virtualenv:
# in the 'cosmo' subdirectory from 'git clone' above make deps
-
Finally, to make sure everything works, you run the
make test
target in the includedMakefile
(assumes a Unix environment):make test -j4 # run parallel tasks on up to 4 CPU cores
See DETAILED INSTALLATION below if you have any problems with the instructions above or the running the scripts.
If you have MOODS and COSMO's dependencies already
installed, you can just copy cosmo.py
and cosmostats.py
to a directory in
your shell's search path and call it good. However there's an install
target in the included Makefile that will handle the details for
you.
If you are on a Unix/Linux system, run make install
. The default installation
prefix is /usr/local
(with scripts being installed to /usr/local/bin
), so
you will likely need to become root with sudo
or similar.
A simpler option is to install to your home directory:
make install PREFIX=$HOME/.local
Most Linux distributions already include ~/.local/bin
in your search path by
default. You may need to log out and back in again for this to take effect. How
to update your shell's PATH
variable is beyond the scope here.
If this is successful, you can run cosmo
or cosmostats
from any directory
on your filesystem, without needing to specify the relative pathnames like
./cosmo.py
in the examples below.
Windows is not currently supported by the method we presently use in our
setup.py
. However, if you have success building MOODS on Windows
and would like to have a go at getting COSMO working, too, a patch or pull
request would be welcome.
The cosmo.py
script does the actual scanning of the FASTA, and
cosmostats.py
compiles summary statistics into a file
named stats.tab
in your current working directory.
cosmo.py
supports the following command-line options:
Option | Description |
---|---|
-fa PATH |
path to FASTA sequence file |
-t |
log-odds score threshold (S/Smax) (default is 0.6 ) |
-P |
(optional) pseudocount for MOODS to use (default is 1 ) |
-p PATH |
path to JASPAR-format PWMs (default is ./jpwm ) |
-d |
maximum allowed distance between motifs (default is 10 ) |
-s |
boolean flag to dinucleotide shuffle the input sequence |
-N |
background run number |
-C |
boolean flag to save coordinates rather than counts |
COSMO writes counts for stereopairs to the local directory in the file
cosmo.counts.tab
. Background scans (with parameters -s
and -N <x>
) are
placed into sequential files named cosmo.counts.tab.<x>
). Coordinates (with
the -C
option, explained below) are saved into a BED-formatted file
cosmo.coords.bed
Example: scan a FASTA file in the current working directory, with a specific log-odds threshold score and maximum allowed distance between motifs:
# (the defaults are 0.6 and 10, respectively)
./cosmo.py -fa h3k27ac.fa -t 0.75 -d 20
Use -N <number>
to start a specific number of background runs.
Use -s
to dinucleotide-shuffle the input sequences.
./cosmo.py -fa h3k27ac.fa -s -N 1
./cosmo.py -fa h3k27ac.fa -s -N 2
⋮
./cosmo.py -fa h3k27ac.fa -s -N <n>
For a large number of background runs, this is best accomplished in a 'for' loop in your favorite shell. Assuming Bash or Z shell:
runs=100
for (( i=1; i<=runs; i++ )); do
./cosmo.py -fa h3k27ac.fa -s -N $i
done
The -C
option produces outputs that are genomic coordinates in BED format,
rather than counts:
./cosmo.py -fa h3k27ac.fa -C
# combine existing 'cosmo.counts.tab*' files into summary stats
./cosmostats.py -N 100
Combined with the example above, for 100 background scans:
runs=100
# assuming Bash or Z shell…
for (( i=1; i<=runs; i++ )); do
./cosmo.py -fa h3k27ac.fa -s -N $i
done
./cosmostats.py -N $runs
If you have problems with the QUICK START section (e.g. MOODS fails to build), here's a fully-manual installation, spelled out.
Did you forget to git clone --recursive
? If you didn't do that, you don't
have the MOODS submodule. Do this:
cd cosmo # if not already there
if test -d .git; then
git submodule init && git submodule update
make moods
else
echo "Oops, this isn't a Git repository." >&2
fi
When you are reminded, run source venv/bin/activate
to switch on the Python
virtual environment; this is how COSMO finds MOODS.
At this point, you should be able to run ./cosmo.py
and get a usage
message (but no Python tracebacks).
Skip to "Running on example FASTA."
You will need to download the MOODS sources from GitHub first:
# remove existing 'MOODS' dir; it's where the Git submodule would go
rmdir MOODS
# or 'curl -LOJ' if you don't have 'wget'
wget https://github.com/jhkorhonen/MOODS/archive/v1.0.2.1.zip
unzip v1.0.2.1.zip && rm -i v1.0.2.1.zip
# move the unpacked directory to 'MOODS', where the Makefile expects it
mv MOODS-1.0.2.1 MOODS
You should be able to make moods
at this point, and the Makefile will guide
you through the rest of the steps. But here's the completely manual way to
reproduce what the Makefile does:
pushd MOODS/src
make
cd ../python
python setup.py build
popd
# install NumPy and SciPy (MOODS dependencies)
pip install -r requirements.txt
# add just-built MOODS library to the PYTHONPATH for this login session
pyplatform=$(python -c 'from distutils.util import get_platform
print(get_platform())')
moodspath=$PWD/MOODS/python/build/lib.$pyplatform
export PYTHONPATH=$moodspath${PYTHONPATH:+:$PYTHONPATH}
At this point, you should be able to run ./cosmo.py
and get a usage
message (but no Python tracebacks).
First, unpack the example FASTA file if necessary, and run several background
scans (in this example, three), specifying the -C
(save coordinates) option
with the last one:
cd examples
test -f example.fa || gunzip example.fa.gz
# vary these parameters to your liking (see USAGE section, above)
defaultargs="-fa example.fa -t 0.6 -d 10 -p jpwm"
../cosmo.py $defaultargs &>1.log &
../cosmo.py $defaultargs &>2.log &
../cosmo.py $defaultargs -C &>3.log &
Wait for all the background jobs to finish, then run a coordinates scan, using
the results from the three background scans (-N 3
):
../cosmo.py $defaultargs -s -N 3
Finally, compute statistics for the three scans (-N 3
) and redirect this
output into a file named stats.tab
:
../cosmostats.py -N 3 > stats.tab
The output stats.tab
is tab-delimited, and may be viewed in the terminal,
e.g., with column -t
, or opened in a spreadsheet program such as Excel,
Google Sheets, or LibreOffice.
If you define an environment variable named COSMO_PWMDIR
, it
becomes the default for the -p
/ --pwmdir
option. Typically, this would be
an absolute path starting at /
, but you can get creative.
This can be useful, for example, when used with Environment Modules, to define a system-wide directory containing the JASPAR matrices for all users.
This variable can also be defined in your login scripts, e.g. your
~/.bash_profile
or ~/.profile
; note that the variable set by a setenv
statement in a modulefile would still take precedence in that case.
The short answer is:
make module
For members of the Weirauch Lab, this will just do the Right Thing™. To avoid
errors here, purge all your modules, deactivate any virtualenvs, and re-load
python/2.7.18-wrl
or a comparable Python 2.7.x module.
For others, this will install the module to /usr/local/modules/cosmo/x.y.z
(where x.y.z
is the currently checked-out version of COSMO) and put the
modulefile in /usr/local/modules/modulefiles/cosmo/x.y.z
.
For you to be able to module load cosmo
, you will need to have run module use /usr/local/modules/modulefiles
in your current shell session or login
scripts, or to have added that to your sitewide configuration files, e.g.
/etc/environment-modules/modulespath
on Debian/Ubuntu systems.
See the definitions of MODULEDESTROOT
and MODULEFILEDEST
in the
Makefile for customization options. For example, if you have custom
modules in ~/modules
and modulefiles in ~/modules/modulefiles
, you can:
make module MODULEDESTROOT=$HOME/modules
Further help with Environment Modules is beyond the scope of this document. See its homepage at https://modules.sf.net for more information.
You can use the included Dockerfile
to simplify local development; it builds
a minimal Debian Linux container with GNU Make and the latest release of Python
2.7 inside.
To use it, build the image locally, then bind mount the repository to /src
inside the container before running commands inside it. For example:
docker build . -t cosmo
# make sure it works
docker run --rm -it cosmo cosmo --help
docker run --rm -it cosmo cosmostats --help
# run tests on sample data included with repository, on 4 CPU cores
docker run --rm -it -v .:/src cosmo make -j4 test
If you're changing the code, make sure make test
passes, or at least you can
figure out the reason why it didn't pass (explain this in your commit
message), before committing to the master/main branch.
For breaking changes — e.g. removing a command-line option or changing the input or output formats in a non-backward-compatible way — then you must:
- increment the whole number (major) part of the version in
setup.py
- …and
git tag vX.Y.Z
, whereX.Y.Z
is the new version number.
See semver.org for more information.
- FASTA inputs must have headers in
chrN:<start>-<end>
format, whereN
is the chromosome number; the nucleotide sequences must also be on a single line. - Given FASTA inputs above about 100 MB, COSMO takes a long time to finish.
- As a workaround split large FASTAs into multiple files before the
>
sequence header lines and concatenate the results from COSMO.
- As a workaround split large FASTAs into multiple files before the
Name | Role | |
---|---|---|
Jeremy Riddell | [email protected] | Primary author |
Kevin Ernst | [email protected] | Contributor |
Matthew Weirauch, PhD | [email protected] | Principal Investigator |
The rights holders are Cincinnati Children's Hospital Medical Center and the contributors.
The software's license is GPLv3, to match that of MOODS. See
LICENSE.txt
for details.