minor refactoring

petermchale · petermchale · commit 73ce304c427e · 2021-08-20T17:33:22.000-06:00
diff --git a/.config.json b/.config.json
@@ -1,6 +1,7 @@
 {
   "train": {
-    "default_neutral_regions": "dist/neutral-regions-test.bed.gz",
+    "default_neutral_regions": "dist/neutral-regions.bed.gz",
+    "default_model": "dist/model.json",
     "training_mode": "concurrent"
   }
 }
diff --git a/README.md b/README.md
@@ -9,20 +9,21 @@ bash install.sh
 bash build-vue-app.sh
 ```
 Only installation on Linux x86_64 is currently supported. 
-Tested in the Protected Environment of the Center for High Performance Computing (CHPC) at University of Utah. 
+Tested in the Protected Environment computer cluster of the Center for High Performance Computing (CHPC) at University of Utah. 
 
 ## Quick Start 
 
 Assuming one has access to the protected environment on the CHPC at University of Utah: 
 
 ```
-[sbatch | bash] tests/train.sh $PWD
+bash tests/train.sh $PWD
 ```
 
 Once training is complete, do: 
 ```
 bash tests/visualize.sh $PWD
 ```
+
 Follow the instructions at the command line to view a web app that visualizes observed mutation counts, and those expected under a null model of sequence-dependent mutation (see `model-definition` folder), as a function of genomic coordinate.  
 
 A plot of estimated mutation probabilities of the neutral model can be found here: https://github.com/quinlan-lab/constraint-tools/blob/main/tests/plot_mutation_probabilities.ipynb
@@ -48,38 +49,44 @@ Required arguments for `train` are:
 
 ```
 --genome STR
-      Path to the reference fasta. 
+      Path to a reference fasta. 
       A "samtools faidx" index is expected to be present at the same path. 
 --mutations STR 
       Path to a set of mutations specified in Mutation Annotation Format.
       A "tabix" index is expected to be present at the same path.
 --kmer-size INT
-      Size of kmer to use in model. 
---output STR 
-      Path to a directory to store results in. 
+      Size of kmer of model to be trained. 
+--model STR 
+      Path to a directory to store trained model in. 
 ```
 
 By default the `train` subcommand uses a pre-computed set of putatively neutral regions from the GRCH37 reference. Optionally, the user may change this by specifying the `--regions` argument: 
 
 ```
---regions STR 
-      Bed-format file containing a list of genomic intervals on which the model is trained.
+--regions STR
+      Bed-format file containing a list of genomic intervals on which the model is to be trained.
 ```
 
 This produces a specification of the sequence-dependent neutral mutation model in json format, viewable using, e.g., 
 ```
-${CONSTRAINT_TOOLS}/bin/jq . ${output}/<json file> 
+${CONSTRAINT_TOOLS}/bin/jq . ${model}/<json file> 
 ```
 
 Required arguments for `visualize` are:
 
 ```
---model STR
-      Path to the neutral model produced by the train sub-command (in json format). This model is used to compute the expected mutation counts in the visualization. 
 --port INT 
       The port to serve the web-app on
 ```
-      
+
+By default the `visualize` subcommand uses a pre-computed model. 
+Optionally, the user may change this by specifying the `--model` argument: 
+
+```
+--model STR
+      Path to a neutral model produced by the train sub-command (in json format). This model is used to compute the expected mutation counts in the visualization. 
+```
+
 ## Input Data
 
 Assuming one has access to the protected environment on the CHPC at University of Utah, 
@@ -89,6 +96,12 @@ then sorted, block-compressed, and indexed vcf, maf, gtf and fasta files can be
 /scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data
 ```
 
+## Production model 
+
+In the `/dist` directory, we distribute a model 
+that was trained on a genome-wide set of putatively neutral regions
+(also located in the `/dist` directory).
+
 ## Development 
 
 Changes to the `vue-app` directory necessitate rebuilding the vue app by running 
diff --git a/flask-app/flask-app b/flask-app/flask-app
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
 
+model="${CONSTRAINT_TOOLS}/$(read-config train default_model)"
+
 # https://devhints.io/bash#miscellaneous
 # put option-fetching before "set -o nounset" so that we can detect flags without arguments
 while [[ "$1" =~ ^- ]]; do 
@@ -11,6 +13,9 @@ while [[ "$1" =~ ^- ]]; do
   shift
 done
 
+info "using the model specified at:"
+info "${model}\n"
+
 set -o errexit
 set -o pipefail
 set -o noclobber
diff --git a/generate-production-model.sh b/generate-production-model.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+#SBATCH --time=3:00:00
+#SBATCH --nodes=1
+# a slurm task is a Linux process:
+# https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#going-parallel
+#SBATCH --ntasks=16
+# slurm does not allocate resources for more than 16 CPUs per job,
+# so request one CPU per Linux process:
+#SBATCH --cpus-per-task=1 
+#SBATCH --account=quinlan-rw
+#SBATCH --partition=quinlan-shared-rw
+
+set -o errexit
+set -o pipefail
+set -o nounset
+# set -o noclobber
+# set -o xtrace
+
+CONSTRAINT_TOOLS=$1 
+
+PATH=${CONSTRAINT_TOOLS}:$PATH 
+PATH=${CONSTRAINT_TOOLS}/utilities:$PATH 
+PATH=${CONSTRAINT_TOOLS}/bin:$PATH 
+
+mutations="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/icgc/mutations.sorted.maf.gz"
+genome="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/reference/grch37/genome.fa.gz"
+kmer_size="5"
+model="${CONSTRAINT_TOOLS}/dist" # path to directory to store model in
+
+fetch_subset_of_regions () { 
+  less ${CONSTRAINT_TOOLS}/dist/neutral-regions.bed.gz | head -100 
+}
+
+train_on_subset_of_regions () {
+  info "$(fetch_subset_of_regions | awk '{ print $0, $3-$2 }')"
+
+  constraint-tools train \
+    --genome ${genome} \
+    --mutations ${mutations} \
+    --kmer-size ${kmer_size} \
+    --regions <(fetch_subset_of_regions | bgzip) \
+    --model ${model}
+}
+
+train_on_all_regions () {
+  constraint-tools train \
+    --genome ${genome} \
+    --mutations ${mutations} \
+    --kmer-size ${kmer_size} \
+    --model ${model}
+}
+
+# train_on_subset_of_regions
+train_on_all_regions
diff --git a/tests/neutral-regions.bed.gz b/tests/neutral-regions.bed.gz
diff --git a/tests/train.sh b/tests/train.sh
@@ -1,9 +1,4 @@
 #!/bin/bash
-#SBATCH --time=3:00:00
-#SBATCH --nodes=2
-#SBATCH --ntasks=16
-#SBATCH --account=quinlan-rw
-#SBATCH --partition=quinlan-shared-rw
 
 set -o errexit
 set -o pipefail
@@ -15,12 +10,14 @@ CONSTRAINT_TOOLS=$1
 
 mutations="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/icgc/mutations.sorted.maf.gz"
 genome="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/reference/grch37/genome.fa.gz"
-kmer_size="5"
-output="${CONSTRAINT_TOOLS}/tests" 
+kmer_size="3"
+regions="${CONSTRAINT_TOOLS}/tests/neutral-regions.bed.gz"
+model="${CONSTRAINT_TOOLS}/tests" # path to directory to store model in
 
 ${CONSTRAINT_TOOLS}/constraint-tools train \
   --genome ${genome} \
   --mutations ${mutations} \
   --kmer-size ${kmer_size} \
-  --output ${output}
+  --regions ${regions} \
+  --model ${model}
 
diff --git a/tests/visualize.sh b/tests/visualize.sh
@@ -6,11 +6,11 @@ set -o nounset
 
 CONSTRAINT_TOOLS=$1
 
-output="${CONSTRAINT_TOOLS}/tests" 
-model="${output}/model.json"
+model="${CONSTRAINT_TOOLS}/tests/model.json" 
 port="5000"
 
 ${CONSTRAINT_TOOLS}/constraint-tools visualize \
   --model ${model} \
   --port ${port}
 
+  
diff --git a/train-model/estimate_mutation_probabilities b/train-model/estimate_mutation_probabilities
@@ -14,7 +14,7 @@ import copy
 import os, subprocess, multiprocessing
 
 from kmer import check_for_Ns, initialize_kmer_data, fetch_kmer_from_sequence, alternate_bases, middle_base, get_bases, contains_unspecified_bases
-from colorize import print_json, print_string_as_error, print_string_as_info, print_string_as_info_dim, print_unbuffered
+from colorize import print_json, print_string_as_info, print_string_as_info_dim, print_unbuffered
 import color_traceback 
 from fetch_SNVs import fetch_SNVs 
 from pack_unpack import unpack, bed_to_sam_string
@@ -67,9 +67,10 @@ def get_hostname_process_cpu():
     'hostname': hostname,
     'process': pid, 
     'cpu': f'{cpu}/{multiprocessing.cpu_count()}'
-    }
+  }
 
 # https://github.com/pysam-developers/pysam/issues/397#issuecomment-328451288
+@timer 
 def compute_counts_region(region):  
   print_json({'region': region, **get_hostname_process_cpu()})
 
@@ -91,7 +92,7 @@ def parse_arguments():
   parser.add_argument('--genome', type=str, help='')
   parser.add_argument('--regions', type=str, help='')
   parser.add_argument('--number-tumors', type=int, dest='number_tumors', help='')
-  parser.add_argument('--output', type=str, help='')
+  parser.add_argument('--model', type=str, help='')
   parser.add_argument('--mutations', type=str, help='')
   parser.add_argument('--training-mode', type=str, dest='training_mode', help='')
   return parser.parse_args()
@@ -174,7 +175,7 @@ def estimate_mutation_probabilities():
   kmer_data = estimate_mutation_probabilities_core(kmer_data) 
 
   args = parse_arguments()
-  model_path = args.output + '/model.json'
+  model_path = args.model + '/model.json'
   with open(model_path, 'w') as fh:
     json.dump({
       'mutations': args.mutations,
diff --git a/train-model/train-model b/train-model/train-model
@@ -10,12 +10,15 @@ while [[ "$1" =~ ^- ]]; do
     --genome ) shift; [[ ! $1 =~ ^- ]] && genome=$1;;
     --regions ) shift; [[ ! $1 =~ ^- ]] && regions=$1;;
     --kmer-size ) shift; [[ ! $1 =~ ^- ]] && kmer_size=$1;;
-    --output ) shift; [[ ! $1 =~ ^- ]] && output=$1;;
+    --model ) shift; [[ ! $1 =~ ^- ]] && model=$1;;
     *) error "$0: $1 is an invalid flag"; exit 1;;
   esac 
   shift
 done
 
+info "training on regions:"
+info "${regions}\n"
+
 set -o errexit
 set -o pipefail
 set -o noclobber
@@ -47,7 +50,7 @@ estimate_mutation_probabilities \
   --genome ${genome} \
   --regions ${regions} \
   --number-tumors ${number_tumors} \
-  --output ${output} \
+  --model ${model} \
   --mutations ${mutations} \
   --training-mode $(read-config train training_mode)
 

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,7 @@`
`1`	`1`	`{`
`2`	`2`	`"train": {`
`3`		`- "default_neutral_regions": "dist/neutral-regions-test.bed.gz",`
	`3`	`+ "default_neutral_regions": "dist/neutral-regions.bed.gz",`
	`4`	`+ "default_model": "dist/model.json",`
`4`	`5`	`"training_mode": "concurrent"`
`5`	`6`	`}`
`6`	`7`	`}`