Skip to content

Commit ece97c9

Browse files
committed
modisco
1 parent eaa0fe5 commit ece97c9

File tree

7 files changed

+574
-0
lines changed

7 files changed

+574
-0
lines changed
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
2+
# Scripts to do MODSICO on deepshap output of ChromBPNet and generate a html link for the outputs with tomtom annotations
3+
4+
The scripts in this folder do the following three steps (1) Do de-novo motif discovery on the deepshap output of chrombpnet using MODISCO (run_modisco.py) (2) Annotate the motifs using TOMTOM (fetch_tomtom.py) and (3) Summarize the output to a html format (visualize_motif_matches.py).
5+
So a requirement for this script is that the outputs generated from step 2 are html hostable. If you dont to host the results online you can remove the visualize_motif_matches.py from the run.sh script below. An example html link will look like this http://mitra.stanford.edu/kundaje/oak/projects/chromatin-atlas-2022/modisco/DNASE/ENCSR000EMA/ranked_feb15/profile.motifs.html.
6+
7+
## Usage
8+
9+
```
10+
modisco.sh [scores_prefix] [output_dir] [score_type] [seqlets] [crop] [meme_db] [meme_logos] [vier_logos] [vier_html] [html_link]
11+
```
12+
13+
The following assumptions are made with this script - make changes accordingly if the assumptions dont hold.
14+
15+
- The following scripts are used on the output of `chrombpnet_deepshap`.
16+
17+
## Example Usage
18+
19+
```
20+
modisco.sh /path/to/deepshap_scores/ /path/to/store/output/ counts_or_profiles 200000 1000 [meme_db] [meme_logos] [vier_logos] [vier_html] [html_link]
21+
```
22+
23+
## Input Format
24+
25+
- scores_prefix: This is the `output_prefix` used with `chrombpnet_deepshap`.
26+
- score_type: This is either set to `counts` or `profile`.
27+
- output_dir: Path to a directory to store the output files. The script assumes that the directory already exists. Look at the output format section below to understand the files generated.
28+
- seqlets: Number of seqlets to use for modisco run. If using the most recent dev version of MODISCO this can be set to 200K. If using older version set to 50K. You can test the working of the script with a much smaller value - as this decides the runtime of the script.
29+
- crop: An integer value representing the crop length to use on the chrombpnet input. In chrombpnet we get contribution scores for 2114 length input but we will run modisco on only 1000 length input. So the default value for this parameter is set to 1000. We do this to avoid catching "AT" rich nucleosome motifs that occur more frequently on the flanks of the 2114 length input.
30+
- meme_db: Path to a txt file containing the meme motifs letter probability matrix. Text file to download - http://mitra.stanford.edu/kundaje/surag/resources/motif_archetypes/pfm_meme_format/motifs.meme.txt
31+
- meme_logos: Path to a directory containing meme pfms - Directory to download - http://mitra.stanford.edu/kundaje/surag/resources/motif_archetypes/pfm/
32+
- vier_logos: A directory with images for the pfms provided in `meme_logos`. If the image is not already present the script creates that image, so it is okay is this is an empty directory. This can be a database of meme images you can share across projects. Make sure that this directory is hostable if you are interedted in viewing the results in a html link.
33+
- vier_html: An html link to `vier_logos` folder. These links are used in `score_type.motifs.html` page created below in output format section.
34+
- html_link: An html link to `untrimmed_logos`. This is one of the outputs generated by run.sh. So you need to eitheir make sure output_dir/trimmed_logos is html hostable or you can copy `untrimmed_logos` to the directory of the html link you set here.
35+
36+
37+
## Output Format
38+
39+
- modisco_results_allChroms_`score_type`.hdf5: Modisco output `hdf5` file.
40+
- seqlets_`score_type`.txt:
41+
- `score_type`.tomtom.tsv: Tomom output.
42+
- `score_type`.motifs.html: An html link where the modisco motifs are ranked based on their frequency. This html link will host html images provided in `html_link` and `vier_html`.
43+
- `untrimmed_logos_profile` or `untrimmed_logos_counts`: This directory will be created in `output_dir` path provided depending on `score_type`. This directory will store the motif images that are not trimmed.
44+
- `trimmed_logos`: This directory will be created in `output_dir` path provided. This directory wil store the motif images that are trimmed based on the default threshold of 0.3 used in the scripts. Make sure this directory is hostable.
45+

chrombpnet/evaluation/modisco/__init__.py

Whitespace-only changes.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from weasyprint import HTML, CSS
2+
import argparse
3+
4+
def main(input_html,output_pdf):
5+
css = CSS(string='''
6+
@page {
7+
size: 1800mm 1300mm;
8+
margin: 0in 0in 0in 0in;
9+
}
10+
''')
11+
HTML(input_html).write_pdf(output_pdf, stylesheets=[css])
12+
13+
if __name__=="__main__":
14+
parser = argparse.ArgumentParser(description='Convert html to pdf')
15+
parser.add_argument('-html','--input_html', required=True, type=str, help='input file path to html')
16+
parser.add_argument('-pdf','--output_pdf', required=True, type=str, help='output file path to pdf')
17+
args = parser.parse_args()
18+
19+
main(args.input_html,args.output_pdf)
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
import numpy as np
2+
import subprocess
3+
import argparse
4+
import h5py
5+
import tempfile
6+
import os
7+
8+
def fetch_tomtom_args():
9+
parser = argparse.ArgumentParser()
10+
parser.add_argument("-m", "--modisco_h5py", required=True, type=str, help="path to the output .h5py file generated by the run_modisco.py script")
11+
parser.add_argument("-o", "--output_prefix", required=True, type=str, help="Path and name of the TSV file to store the tomtom output")
12+
parser.add_argument("-d", "--meme_motif_db", required=True, type=str, help="path to motif database")
13+
parser.add_argument("-n", "--top_n_matches", type=int, default=3, help="Max number of matches to return from TomTom")
14+
parser.add_argument("-tt", "--tomtom_exec", type=str, default='tomtom', help="Command to use to execute tomtom")
15+
parser.add_argument("-th", "--trim_threshold", type=float, default=0.3, help="Trim threshold for trimming long motif, trim to those with at least prob trim_threshold on both ends")
16+
parser.add_argument("-tm", "--trim_min_length", type=int, default=3, help="Minimum acceptable length of motif after trimming")
17+
args = parser.parse_args()
18+
return args
19+
20+
21+
def write_meme_file(ppm, bg, fname):
22+
f = open(fname, 'w')
23+
f.write('MEME version 4\n\n')
24+
f.write('ALPHABET= ACGT\n\n')
25+
f.write('strands: + -\n\n')
26+
f.write('Background letter frequencies (from unknown source):\n')
27+
f.write('A %.3f C %.3f G %.3f T %.3f\n\n' % tuple(list(bg)))
28+
f.write('MOTIF 1 TEMP\n\n')
29+
f.write('letter-probability matrix: alength= 4 w= %d nsites= 1 E= 0e+0\n' % ppm.shape[0])
30+
for s in ppm:
31+
f.write('%.5f %.5f %.5f %.5f\n' % tuple(s))
32+
f.close()
33+
34+
35+
def fetch_tomtom_matches(ppm, cwm, background=[0.25, 0.25, 0.25, 0.25], tomtom_exec_path='tomtom', motifs_db='HOCOMOCOv11_core_HUMAN_mono_meme_format.meme', n=5, trim_threshold=0.3, trim_min_length=3):
36+
37+
"""Fetches top matches from a motifs database using TomTom.
38+
Args:
39+
ppm: position probability matrix- numpy matrix of dimension (N,4)
40+
background: list with ACGT background probabilities
41+
tomtom_exec_path: path to TomTom executable
42+
motifs_db: path to motifs database in meme format
43+
n: number of top matches to return, ordered by p-value
44+
temp_dir: directory for storing temp files
45+
trim_threshold: the ppm is trimmed from left till first position for which
46+
probability for any base pair >= trim_threshold. Similarly from right.
47+
Returns:
48+
list: a list of up to n results returned by tomtom, each entry is a
49+
dictionary with keys 'Target ID', 'p-value', 'E-value', 'q-value'
50+
"""
51+
52+
_, fname = tempfile.mkstemp()
53+
54+
score = np.sum(np.abs(cwm), axis=1)
55+
trim_thresh = np.max(score) * trim_threshold # Cut off anything less than 30% of max score
56+
pass_inds = np.where(score >= trim_thresh)[0]
57+
trimmed = ppm[np.min(pass_inds): np.max(pass_inds) + 1]
58+
59+
# can be None of no base has prob>t
60+
if trimmed is None:
61+
return []
62+
63+
# trim and prepare meme file
64+
write_meme_file(trimmed, background, fname)
65+
66+
# run tomtom
67+
cmd = '%s -no-ssc -oc . -verbosity 1 -text -min-overlap 5 -mi 1 -dist pearson -evalue -thresh 10.0 %s %s' % (tomtom_exec_path, fname, motifs_db)
68+
#print(cmd)
69+
out = subprocess.check_output(cmd, shell=True)
70+
71+
# prepare output
72+
dat = [x.split('\\t') for x in str(out).split('\\n')]
73+
schema = dat[0]
74+
75+
# meme v4 vs v5:
76+
if 'Target ID' in schema:
77+
tget_idx = schema.index('Target ID')
78+
else:
79+
tget_idx = schema.index('Target_ID')
80+
81+
pval_idx, eval_idx, qval_idx =schema.index('p-value'), schema.index('E-value'), schema.index('q-value')
82+
83+
r = []
84+
for t in dat[1:min(1+n, len(dat)-1)]:
85+
if t[0]=='':
86+
break
87+
88+
mtf = {}
89+
mtf['Target_ID'] = t[tget_idx]
90+
mtf['p-value'] = float(t[pval_idx])
91+
mtf['E-value'] = float(t[eval_idx])
92+
mtf['q-value'] = float(t[qval_idx])
93+
r.append(mtf)
94+
95+
os.system('rm ' + fname)
96+
return r
97+
98+
99+
def main():
100+
args = fetch_tomtom_args()
101+
102+
modisco_results = h5py.File(args.modisco_h5py, 'r')
103+
104+
# get pfms
105+
ppms = []
106+
cwms = []
107+
seqlet_tally = []
108+
names = []
109+
110+
for metacluster_name in modisco_results['metacluster_idx_to_submetacluster_results']:
111+
metacluster = modisco_results['metacluster_idx_to_submetacluster_results'][metacluster_name]
112+
all_pattern_names = [x.decode("utf-8") for x in list(metacluster["seqlets_to_patterns_result"]["patterns"]["all_pattern_names"][:])]
113+
114+
for pattern_name in all_pattern_names:
115+
116+
ppm = np.array(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]['sequence']['fwd'])
117+
num_seqlets = len(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]['seqlets_and_alnmts']['seqlets'])
118+
cwm = np.array(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]["task0_contrib_scores"]['fwd'])
119+
120+
ppms.append(ppm)
121+
seqlet_tally.append(num_seqlets)
122+
cwms.append(cwm)
123+
names.append(metacluster_name + '.' + pattern_name)
124+
125+
modisco_results.close()
126+
127+
res = []
128+
129+
for i,x in enumerate(ppms):
130+
res.append(fetch_tomtom_matches(x, cwms[i], tomtom_exec_path=args.tomtom_exec, motifs_db=args.meme_motif_db,
131+
n=args.top_n_matches, trim_threshold=args.trim_threshold, trim_min_length=args.trim_min_length))
132+
133+
# write output. Skipping those patterns which disappear after trimming or have no matches
134+
with open(args.output_prefix, 'w') as f:
135+
# write header
136+
f.write("Pattern")
137+
f.write("\tNum_Seqlets")
138+
139+
for i in range(args.top_n_matches):
140+
f.write("\tMatch_{}\tq-value".format(i+1))
141+
f.write("\n")
142+
143+
assert len(res) == len(names)
144+
145+
for i,r in enumerate(res):
146+
f.write(names[i])
147+
f.write("\t{}".format(seqlet_tally[i]))
148+
for match in r:
149+
f.write("\t{}\t{}".format(match['Target_ID'], match['q-value']))
150+
151+
# when fewer than n matches are found
152+
if len(r) != args.top_n_matches:
153+
f.write("\t\t"*(args.top_n_matches-len(r)))
154+
f.write("\n")
155+
if __name__=="__main__":
156+
main()
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/bin/bash
2+
3+
# exit when any command fails
4+
set -e
5+
6+
# keep track of the last executed command
7+
trap 'last_command=$current_command; current_command=$BASH_COMMAND' DEBUG
8+
9+
cleanup() {
10+
exit_code=$?
11+
if [ ${exit_code} == 0 ]
12+
then
13+
echo "Completed execution"
14+
else
15+
echo "\"${last_command}\" failed with exit code ${exit_code}."
16+
fi
17+
}
18+
19+
# echo an error message before exiting
20+
trap 'cleanup' EXIT INT TERM
21+
22+
scores_prefix=${1?param missing - scores_prefix}
23+
output_dir=${2?param missing - output_dir}
24+
score_type=${3?param missing - score_type}
25+
seqlets=${4?param missing - seqlets}
26+
crop=${5?param missing - crop}
27+
meme_db=${6?param missing - meme_db}
28+
meme_logos=${7?param missing - meme_logos}
29+
vier_logos=${8?param_missing - vier_logos}
30+
vier_html=${9?param_missing - vier_html}
31+
html_link=${10?param_missing - html_link}
32+
33+
chrombpnet_modisco -s $scores_prefix -p $score_type -o $output_dir -m $seqlets -c $crop
34+
chrombpnet_tomtom_hits -m $output_dir/modisco_results_allChroms_counts.hdf5 -o $output_dir/$score_type.tomtom.tsv -d $meme_db -n 10 -th 0.3
35+
chrombpnet_visualize_motif_matches -m $output_dir/modisco_results_allChroms_counts.hdf5 -t $output_dir/$score_type.tomtom.tsv -o $output_dir \
36+
-vd $vier_logos -th 0.3 -hl $html_link -vhl $vier_html \
37+
-s $score_type -d $meme_logos
38+

0 commit comments

Comments
 (0)