Skip to content

Commit 73ce304

Browse files
committed
minor refactoring
1 parent 0ce0b77 commit 73ce304

File tree

9 files changed

+103
-29
lines changed

9 files changed

+103
-29
lines changed

.config.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
{
22
"train": {
3-
"default_neutral_regions": "dist/neutral-regions-test.bed.gz",
3+
"default_neutral_regions": "dist/neutral-regions.bed.gz",
4+
"default_model": "dist/model.json",
45
"training_mode": "concurrent"
56
}
67
}

README.md

Lines changed: 25 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,21 @@ bash install.sh
99
bash build-vue-app.sh
1010
```
1111
Only installation on Linux x86_64 is currently supported.
12-
Tested in the Protected Environment of the Center for High Performance Computing (CHPC) at University of Utah.
12+
Tested in the Protected Environment computer cluster of the Center for High Performance Computing (CHPC) at University of Utah.
1313

1414
## Quick Start
1515

1616
Assuming one has access to the protected environment on the CHPC at University of Utah:
1717

1818
```
19-
[sbatch | bash] tests/train.sh $PWD
19+
bash tests/train.sh $PWD
2020
```
2121

2222
Once training is complete, do:
2323
```
2424
bash tests/visualize.sh $PWD
2525
```
26+
2627
Follow the instructions at the command line to view a web app that visualizes observed mutation counts, and those expected under a null model of sequence-dependent mutation (see `model-definition` folder), as a function of genomic coordinate.
2728

2829
A plot of estimated mutation probabilities of the neutral model can be found here: https://github.com/quinlan-lab/constraint-tools/blob/main/tests/plot_mutation_probabilities.ipynb
@@ -48,38 +49,44 @@ Required arguments for `train` are:
4849

4950
```
5051
--genome STR
51-
Path to the reference fasta.
52+
Path to a reference fasta.
5253
A "samtools faidx" index is expected to be present at the same path.
5354
--mutations STR
5455
Path to a set of mutations specified in Mutation Annotation Format.
5556
A "tabix" index is expected to be present at the same path.
5657
--kmer-size INT
57-
Size of kmer to use in model.
58-
--output STR
59-
Path to a directory to store results in.
58+
Size of kmer of model to be trained.
59+
--model STR
60+
Path to a directory to store trained model in.
6061
```
6162

6263
By default the `train` subcommand uses a pre-computed set of putatively neutral regions from the GRCH37 reference. Optionally, the user may change this by specifying the `--regions` argument:
6364

6465
```
65-
--regions STR
66-
Bed-format file containing a list of genomic intervals on which the model is trained.
66+
--regions STR
67+
Bed-format file containing a list of genomic intervals on which the model is to be trained.
6768
```
6869

6970
This produces a specification of the sequence-dependent neutral mutation model in json format, viewable using, e.g.,
7071
```
71-
${CONSTRAINT_TOOLS}/bin/jq . ${output}/<json file>
72+
${CONSTRAINT_TOOLS}/bin/jq . ${model}/<json file>
7273
```
7374

7475
Required arguments for `visualize` are:
7576

7677
```
77-
--model STR
78-
Path to the neutral model produced by the train sub-command (in json format). This model is used to compute the expected mutation counts in the visualization.
7978
--port INT
8079
The port to serve the web-app on
8180
```
82-
81+
82+
By default the `visualize` subcommand uses a pre-computed model.
83+
Optionally, the user may change this by specifying the `--model` argument:
84+
85+
```
86+
--model STR
87+
Path to a neutral model produced by the train sub-command (in json format). This model is used to compute the expected mutation counts in the visualization.
88+
```
89+
8390
## Input Data
8491

8592
Assuming one has access to the protected environment on the CHPC at University of Utah,
@@ -89,6 +96,12 @@ then sorted, block-compressed, and indexed vcf, maf, gtf and fasta files can be
8996
/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data
9097
```
9198

99+
## Production model
100+
101+
In the `/dist` directory, we distribute a model
102+
that was trained on a genome-wide set of putatively neutral regions
103+
(also located in the `/dist` directory).
104+
92105
## Development
93106

94107
Changes to the `vue-app` directory necessitate rebuilding the vue app by running

flask-app/flask-app

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/usr/bin/env bash
22

3+
model="${CONSTRAINT_TOOLS}/$(read-config train default_model)"
4+
35
# https://devhints.io/bash#miscellaneous
46
# put option-fetching before "set -o nounset" so that we can detect flags without arguments
57
while [[ "$1" =~ ^- ]]; do
@@ -11,6 +13,9 @@ while [[ "$1" =~ ^- ]]; do
1113
shift
1214
done
1315

16+
info "using the model specified at:"
17+
info "${model}\n"
18+
1419
set -o errexit
1520
set -o pipefail
1621
set -o noclobber

generate-production-model.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#!/bin/bash
2+
#SBATCH --time=3:00:00
3+
#SBATCH --nodes=1
4+
# a slurm task is a Linux process:
5+
# https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#going-parallel
6+
#SBATCH --ntasks=16
7+
# slurm does not allocate resources for more than 16 CPUs per job,
8+
# so request one CPU per Linux process:
9+
#SBATCH --cpus-per-task=1
10+
#SBATCH --account=quinlan-rw
11+
#SBATCH --partition=quinlan-shared-rw
12+
13+
set -o errexit
14+
set -o pipefail
15+
set -o nounset
16+
# set -o noclobber
17+
# set -o xtrace
18+
19+
CONSTRAINT_TOOLS=$1
20+
21+
PATH=${CONSTRAINT_TOOLS}:$PATH
22+
PATH=${CONSTRAINT_TOOLS}/utilities:$PATH
23+
PATH=${CONSTRAINT_TOOLS}/bin:$PATH
24+
25+
mutations="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/icgc/mutations.sorted.maf.gz"
26+
genome="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/reference/grch37/genome.fa.gz"
27+
kmer_size="5"
28+
model="${CONSTRAINT_TOOLS}/dist" # path to directory to store model in
29+
30+
fetch_subset_of_regions () {
31+
less ${CONSTRAINT_TOOLS}/dist/neutral-regions.bed.gz | head -100
32+
}
33+
34+
train_on_subset_of_regions () {
35+
info "$(fetch_subset_of_regions | awk '{ print $0, $3-$2 }')"
36+
37+
constraint-tools train \
38+
--genome ${genome} \
39+
--mutations ${mutations} \
40+
--kmer-size ${kmer_size} \
41+
--regions <(fetch_subset_of_regions | bgzip) \
42+
--model ${model}
43+
}
44+
45+
train_on_all_regions () {
46+
constraint-tools train \
47+
--genome ${genome} \
48+
--mutations ${mutations} \
49+
--kmer-size ${kmer_size} \
50+
--model ${model}
51+
}
52+
53+
# train_on_subset_of_regions
54+
train_on_all_regions

tests/train.sh

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,4 @@
11
#!/bin/bash
2-
#SBATCH --time=3:00:00
3-
#SBATCH --nodes=2
4-
#SBATCH --ntasks=16
5-
#SBATCH --account=quinlan-rw
6-
#SBATCH --partition=quinlan-shared-rw
72

83
set -o errexit
94
set -o pipefail
@@ -15,12 +10,14 @@ CONSTRAINT_TOOLS=$1
1510

1611
mutations="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/icgc/mutations.sorted.maf.gz"
1712
genome="/scratch/ucgd/lustre-work/quinlan/u6018199/constraint-tools/data/reference/grch37/genome.fa.gz"
18-
kmer_size="5"
19-
output="${CONSTRAINT_TOOLS}/tests"
13+
kmer_size="3"
14+
regions="${CONSTRAINT_TOOLS}/tests/neutral-regions.bed.gz"
15+
model="${CONSTRAINT_TOOLS}/tests" # path to directory to store model in
2016

2117
${CONSTRAINT_TOOLS}/constraint-tools train \
2218
--genome ${genome} \
2319
--mutations ${mutations} \
2420
--kmer-size ${kmer_size} \
25-
--output ${output}
21+
--regions ${regions} \
22+
--model ${model}
2623

tests/visualize.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ set -o nounset
66

77
CONSTRAINT_TOOLS=$1
88

9-
output="${CONSTRAINT_TOOLS}/tests"
10-
model="${output}/model.json"
9+
model="${CONSTRAINT_TOOLS}/tests/model.json"
1110
port="5000"
1211

1312
${CONSTRAINT_TOOLS}/constraint-tools visualize \
1413
--model ${model} \
1514
--port ${port}
1615

16+

train-model/estimate_mutation_probabilities

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ import copy
1414
import os, subprocess, multiprocessing
1515

1616
from kmer import check_for_Ns, initialize_kmer_data, fetch_kmer_from_sequence, alternate_bases, middle_base, get_bases, contains_unspecified_bases
17-
from colorize import print_json, print_string_as_error, print_string_as_info, print_string_as_info_dim, print_unbuffered
17+
from colorize import print_json, print_string_as_info, print_string_as_info_dim, print_unbuffered
1818
import color_traceback
1919
from fetch_SNVs import fetch_SNVs
2020
from pack_unpack import unpack, bed_to_sam_string
@@ -67,9 +67,10 @@ def get_hostname_process_cpu():
6767
'hostname': hostname,
6868
'process': pid,
6969
'cpu': f'{cpu}/{multiprocessing.cpu_count()}'
70-
}
70+
}
7171

7272
# https://github.com/pysam-developers/pysam/issues/397#issuecomment-328451288
73+
@timer
7374
def compute_counts_region(region):
7475
print_json({'region': region, **get_hostname_process_cpu()})
7576

@@ -91,7 +92,7 @@ def parse_arguments():
9192
parser.add_argument('--genome', type=str, help='')
9293
parser.add_argument('--regions', type=str, help='')
9394
parser.add_argument('--number-tumors', type=int, dest='number_tumors', help='')
94-
parser.add_argument('--output', type=str, help='')
95+
parser.add_argument('--model', type=str, help='')
9596
parser.add_argument('--mutations', type=str, help='')
9697
parser.add_argument('--training-mode', type=str, dest='training_mode', help='')
9798
return parser.parse_args()
@@ -174,7 +175,7 @@ def estimate_mutation_probabilities():
174175
kmer_data = estimate_mutation_probabilities_core(kmer_data)
175176

176177
args = parse_arguments()
177-
model_path = args.output + '/model.json'
178+
model_path = args.model + '/model.json'
178179
with open(model_path, 'w') as fh:
179180
json.dump({
180181
'mutations': args.mutations,

train-model/train-model

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,15 @@ while [[ "$1" =~ ^- ]]; do
1010
--genome ) shift; [[ ! $1 =~ ^- ]] && genome=$1;;
1111
--regions ) shift; [[ ! $1 =~ ^- ]] && regions=$1;;
1212
--kmer-size ) shift; [[ ! $1 =~ ^- ]] && kmer_size=$1;;
13-
--output ) shift; [[ ! $1 =~ ^- ]] && output=$1;;
13+
--model ) shift; [[ ! $1 =~ ^- ]] && model=$1;;
1414
*) error "$0: $1 is an invalid flag"; exit 1;;
1515
esac
1616
shift
1717
done
1818

19+
info "training on regions:"
20+
info "${regions}\n"
21+
1922
set -o errexit
2023
set -o pipefail
2124
set -o noclobber
@@ -47,7 +50,7 @@ estimate_mutation_probabilities \
4750
--genome ${genome} \
4851
--regions ${regions} \
4952
--number-tumors ${number_tumors} \
50-
--output ${output} \
53+
--model ${model} \
5154
--mutations ${mutations} \
5255
--training-mode $(read-config train training_mode)
5356

0 commit comments

Comments
 (0)