kundajelab
diff --git a/‎chrombpnet/evaluation/modisco/README.md‎
Lines changed: 45 additions & 0 deletions b/‎chrombpnet/evaluation/modisco/README.md‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎chrombpnet/evaluation/modisco/__init__.py‎ b/‎chrombpnet/evaluation/modisco/__init__.py‎
diff --git a/‎chrombpnet/evaluation/modisco/convert_html_to_pdf.py‎
Lines changed: 19 additions & 0 deletions b/‎chrombpnet/evaluation/modisco/convert_html_to_pdf.py‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎chrombpnet/evaluation/modisco/fetch_tomtom.py‎
Lines changed: 156 additions & 0 deletions b/‎chrombpnet/evaluation/modisco/fetch_tomtom.py‎
Lines changed: 156 additions & 0 deletions
diff --git a/‎chrombpnet/evaluation/modisco/modisco.sh‎
Lines changed: 38 additions & 0 deletions b/‎chrombpnet/evaluation/modisco/modisco.sh‎
Lines changed: 38 additions & 0 deletions
@@ -0,0 +1,45 @@
+
+# Scripts to do MODSICO on deepshap output of ChromBPNet and generate a html link for the outputs with tomtom annotations
+
+The scripts in this folder do the following three steps (1)  Do de-novo motif discovery on the deepshap output of chrombpnet using MODISCO (run_modisco.py) (2) Annotate the motifs using TOMTOM (fetch_tomtom.py) and (3) Summarize the output to a html format (visualize_motif_matches.py). 
+So a requirement for this script is that the outputs generated from step 2 are html hostable. If you dont to host the results online you can remove the visualize_motif_matches.py from the run.sh script below. An example html link will look like this http://mitra.stanford.edu/kundaje/oak/projects/chromatin-atlas-2022/modisco/DNASE/ENCSR000EMA/ranked_feb15/profile.motifs.html.
+
+## Usage
+
+```
+modisco.sh [scores_prefix] [output_dir] [score_type] [seqlets] [crop] [meme_db] [meme_logos] [vier_logos] [vier_html] [html_link]
+```
+
+The following assumptions are made with this script - make changes accordingly if the assumptions dont hold.
+
+- The following scripts are used on the output of `chrombpnet_deepshap`. 
+
+## Example Usage
+
+```
+modisco.sh /path/to/deepshap_scores/ /path/to/store/output/ counts_or_profiles 200000 1000 [meme_db] [meme_logos] [vier_logos] [vier_html] [html_link]
+```
+
+## Input Format
+
+- scores_prefix: This is the `output_prefix` used with `chrombpnet_deepshap`. 
+- score_type: This is either set to `counts` or `profile`.
+- output_dir: Path to a directory to store the output files. The script assumes that the directory already exists. Look at the output format section below to understand the files generated.
+- seqlets: Number of seqlets to use for modisco run. If using the most recent dev version of MODISCO this can be set to 200K. If using older version set to 50K. You can test the working of the script with a much smaller value - as this decides the runtime of the script.
+- crop: An integer value representing the crop length to use on the chrombpnet input. In chrombpnet we get contribution scores for 2114 length input but we will run modisco on only 1000 length input. So the default value for this parameter is set to 1000. We do this to avoid catching "AT" rich nucleosome motifs that occur more frequently on the flanks of the 2114 length input.
+- meme_db: Path to a txt file containing the meme motifs letter probability matrix. Text file to download - http://mitra.stanford.edu/kundaje/surag/resources/motif_archetypes/pfm_meme_format/motifs.meme.txt
+- meme_logos: Path to a directory containing meme pfms - Directory to download - http://mitra.stanford.edu/kundaje/surag/resources/motif_archetypes/pfm/
+- vier_logos: A directory with images for the pfms provided in `meme_logos`. If the image is not already present the script creates that image, so it is okay is this is an empty directory. This can be a database of meme images you can share across projects. Make sure that this directory is hostable if you are interedted in viewing the results in a html link. 
+- vier_html: An html link to `vier_logos` folder. These links are used in `score_type.motifs.html` page created below in output format section. 
+- html_link: An html link to `untrimmed_logos`. This is one of the outputs generated by run.sh. So you need to eitheir make sure output_dir/trimmed_logos is html hostable or you can copy `untrimmed_logos`  to the directory of the  html link you set here.
+
+
+## Output Format
+
+- modisco_results_allChroms_`score_type`.hdf5: Modisco output `hdf5` file.
+- seqlets_`score_type`.txt: 
+- `score_type`.tomtom.tsv: Tomom output.
+- `score_type`.motifs.html: An html link where the modisco motifs are ranked based on their frequency. This html link will host html images provided in `html_link` and `vier_html`. 
+- `untrimmed_logos_profile` or `untrimmed_logos_counts`: This directory will be created in `output_dir` path provided depending on `score_type`. This directory will store the motif images that are not trimmed.
+- `trimmed_logos`: This directory will be created in `output_dir` path provided. This directory wil store the motif images that are trimmed based on the default threshold of 0.3 used in the scripts. Make sure this directory is hostable.
+
@@ -0,0 +1,19 @@
+from weasyprint import HTML, CSS
+import argparse
+
+def main(input_html,output_pdf):
+	css = CSS(string='''
+		@page {
+    		size: 1800mm 1300mm;
+    		margin: 0in 0in 0in 0in;
+		}
+	''')
+	HTML(input_html).write_pdf(output_pdf, stylesheets=[css])
+
+if __name__=="__main__":
+	parser = argparse.ArgumentParser(description='Convert html to pdf')
+	parser.add_argument('-html','--input_html', required=True, type=str,  help='input file path to html')
+	parser.add_argument('-pdf','--output_pdf', required=True, type=str,  help='output file path to pdf')
+	args = parser.parse_args()
+
+	main(args.input_html,args.output_pdf)
@@ -0,0 +1,156 @@
+import numpy as np
+import subprocess
+import argparse
+import h5py
+import tempfile
+import os
+
+def fetch_tomtom_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-m", "--modisco_h5py", required=True, type=str, help="path to the output .h5py file generated by the run_modisco.py script")
+    parser.add_argument("-o", "--output_prefix", required=True, type=str, help="Path and name of the TSV file to store the tomtom output")
+    parser.add_argument("-d", "--meme_motif_db", required=True, type=str, help="path to motif database")
+    parser.add_argument("-n", "--top_n_matches", type=int, default=3, help="Max number of matches to return from TomTom")
+    parser.add_argument("-tt", "--tomtom_exec", type=str, default='tomtom', help="Command to use to execute tomtom")
+    parser.add_argument("-th", "--trim_threshold", type=float, default=0.3, help="Trim threshold for trimming long motif, trim to those with at least prob trim_threshold on both ends")
+    parser.add_argument("-tm", "--trim_min_length", type=int, default=3, help="Minimum acceptable length of motif after trimming")
+    args = parser.parse_args()
+    return args
+
+
+def write_meme_file(ppm, bg, fname):
+    f = open(fname, 'w')
+    f.write('MEME version 4\n\n')
+    f.write('ALPHABET= ACGT\n\n')
+    f.write('strands: + -\n\n')
+    f.write('Background letter frequencies (from unknown source):\n')
+    f.write('A %.3f C %.3f G %.3f T %.3f\n\n' % tuple(list(bg)))
+    f.write('MOTIF 1 TEMP\n\n')
+    f.write('letter-probability matrix: alength= 4 w= %d nsites= 1 E= 0e+0\n' % ppm.shape[0])
+    for s in ppm:
+        f.write('%.5f %.5f %.5f %.5f\n' % tuple(s))
+    f.close()
+
+
+def fetch_tomtom_matches(ppm, cwm, background=[0.25, 0.25, 0.25, 0.25], tomtom_exec_path='tomtom', motifs_db='HOCOMOCOv11_core_HUMAN_mono_meme_format.meme', n=5, trim_threshold=0.3, trim_min_length=3):
+
+    """Fetches top matches from a motifs database using TomTom.
+    Args:
+        ppm: position probability matrix- numpy matrix of dimension (N,4)
+        background: list with ACGT background probabilities
+        tomtom_exec_path: path to TomTom executable
+        motifs_db: path to motifs database in meme format
+        n: number of top matches to return, ordered by p-value
+        temp_dir: directory for storing temp files
+        trim_threshold: the ppm is trimmed from left till first position for which
+            probability for any base pair >= trim_threshold. Similarly from right.
+    Returns:
+        list: a list of up to n results returned by tomtom, each entry is a
+            dictionary with keys 'Target ID', 'p-value', 'E-value', 'q-value'
+    """
+
+    _, fname = tempfile.mkstemp()
+
+    score = np.sum(np.abs(cwm), axis=1)
+    trim_thresh = np.max(score) * trim_threshold  # Cut off anything less than 30% of max score
+    pass_inds = np.where(score >= trim_thresh)[0]
+    trimmed = ppm[np.min(pass_inds): np.max(pass_inds) + 1]
+
+    # can be None of no base has prob>t
+    if trimmed is None:
+        return []
+
+    # trim and prepare meme file
+    write_meme_file(trimmed, background, fname)
+
+    # run tomtom
+    cmd = '%s -no-ssc -oc . -verbosity 1 -text -min-overlap 5 -mi 1 -dist pearson -evalue -thresh 10.0 %s %s' % (tomtom_exec_path, fname, motifs_db)
+    #print(cmd)
+    out = subprocess.check_output(cmd, shell=True)
+
+    # prepare output
+    dat = [x.split('\\t') for x in str(out).split('\\n')]
+    schema = dat[0]
+
+    # meme v4 vs v5:
+    if 'Target ID' in schema:
+        tget_idx = schema.index('Target ID')
+    else:
+        tget_idx = schema.index('Target_ID')
+
+    pval_idx, eval_idx, qval_idx =schema.index('p-value'), schema.index('E-value'), schema.index('q-value')
+
+    r = []
+    for t in dat[1:min(1+n, len(dat)-1)]:
+        if t[0]=='':
+            break
+
+        mtf = {}
+        mtf['Target_ID'] = t[tget_idx]
+        mtf['p-value'] = float(t[pval_idx])
+        mtf['E-value'] = float(t[eval_idx])
+        mtf['q-value'] = float(t[qval_idx])
+        r.append(mtf)
+
+    os.system('rm ' + fname)
+    return r
+
+
+def main(): 
+    args = fetch_tomtom_args()
+
+    modisco_results = h5py.File(args.modisco_h5py, 'r')
+
+    # get pfms
+    ppms = []
+    cwms = []
+    seqlet_tally = []
+    names = []
+
+    for metacluster_name in modisco_results['metacluster_idx_to_submetacluster_results']:
+        metacluster = modisco_results['metacluster_idx_to_submetacluster_results'][metacluster_name]
+        all_pattern_names = [x.decode("utf-8") for x in list(metacluster["seqlets_to_patterns_result"]["patterns"]["all_pattern_names"][:])]
+
+        for pattern_name in all_pattern_names:
+
+            ppm = np.array(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]['sequence']['fwd'])
+            num_seqlets = len(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]['seqlets_and_alnmts']['seqlets'])
+            cwm = np.array(metacluster['seqlets_to_patterns_result']['patterns'][pattern_name]["task0_contrib_scores"]['fwd'])
+
+            ppms.append(ppm)
+            seqlet_tally.append(num_seqlets)
+            cwms.append(cwm)
+            names.append(metacluster_name + '.' + pattern_name)
+
+    modisco_results.close()
+
+    res = []
+
+    for i,x in enumerate(ppms):
+        res.append(fetch_tomtom_matches(x, cwms[i], tomtom_exec_path=args.tomtom_exec, motifs_db=args.meme_motif_db,
+                   n=args.top_n_matches, trim_threshold=args.trim_threshold, trim_min_length=args.trim_min_length))
+
+    # write output. Skipping those patterns which disappear after trimming or have no matches
+    with open(args.output_prefix, 'w') as f:
+        # write header
+        f.write("Pattern")
+        f.write("\tNum_Seqlets")
+
+        for i in range(args.top_n_matches):
+            f.write("\tMatch_{}\tq-value".format(i+1))
+        f.write("\n")
+
+        assert len(res) == len(names)
+
+        for i,r in enumerate(res):
+            f.write(names[i])
+            f.write("\t{}".format(seqlet_tally[i]))
+            for match in r:
+                f.write("\t{}\t{}".format(match['Target_ID'], match['q-value']))
+
+            # when fewer than n matches are found
+            if len(r) != args.top_n_matches:
+                f.write("\t\t"*(args.top_n_matches-len(r)))
+            f.write("\n")
+if __name__=="__main__":
+    main()
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# exit when any command fails
+set -e
+
+# keep track of the last executed command
+trap 'last_command=$current_command; current_command=$BASH_COMMAND' DEBUG
+
+cleanup() {
+    exit_code=$?
+    if [ ${exit_code} == 0 ]
+    then
+	echo "Completed execution"
+    else
+	echo "\"${last_command}\" failed with exit code ${exit_code}."
+    fi
+}
+
+# echo an error message before exiting
+trap 'cleanup' EXIT INT TERM
+
+scores_prefix=${1?param missing - scores_prefix}
+output_dir=${2?param missing - output_dir}
+score_type=${3?param missing - score_type}
+seqlets=${4?param missing - seqlets}
+crop=${5?param missing - crop} 
+meme_db=${6?param missing - meme_db}
+meme_logos=${7?param missing - meme_logos}
+vier_logos=${8?param_missing - vier_logos}
+vier_html=${9?param_missing - vier_html}
+html_link=${10?param_missing - html_link}
+
+chrombpnet_modisco -s $scores_prefix -p $score_type -o $output_dir -m $seqlets -c $crop
+chrombpnet_tomtom_hits -m $output_dir/modisco_results_allChroms_counts.hdf5 -o $output_dir/$score_type.tomtom.tsv -d $meme_db -n 10 -th 0.3
+chrombpnet_visualize_motif_matches -m $output_dir/modisco_results_allChroms_counts.hdf5 -t $output_dir/$score_type.tomtom.tsv -o $output_dir \
+     -vd $vier_logos -th 0.3 -hl $html_link -vhl $vier_html \
+      -s $score_type -d $meme_logos
+