Skip to content

Commit a4c03ad

Browse files
authored
Merge pull request #133 from griffithlab/min_and_average
Min and average
2 parents 7b80c9c + c800cc8 commit a4c03ad

File tree

2 files changed

+74
-39
lines changed

2 files changed

+74
-39
lines changed

docs/workflow.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RegTools example workflow
22

3-
This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a vcf and RNA bam/cram file for each samples. The outline described below was used to run our own analysis on TCGA data.
3+
This is an example workflow for running RegTools on a cohort of samples. This analysis requires that there be a VCF and RNA bam file for each sample. The workflow described below was used to run our own analysis on TCGA data.
44

55
By the end of the analysis, the directory structure should look like the example below. The `*` in the example below refers to the tag/parameter used to run `regtools cis-splice-effects identify` with.
66

@@ -40,65 +40,66 @@ By the end of the analysis, the directory structure should look like the example
4040
- junction_pvalues_*.tsv
4141
```
4242

43-
### Set tag and parameter shell arguments
43+
## RegTools commands
44+
45+
**Set tag and parameter shell arguments**
4446

4547
```bash
4648
tag=<tag>
4749
param=<run option>
4850
# (e.g. tag=default param=""; tag=E param="-E"; tag=i20e5 param="-i 20 -e 5")
4951
```
5052

51-
### Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size
53+
**Run `regtools cis-splice-effects identify` with desired options for selecting variant and window size**
5254

5355
```bash
5456
for i in samples/*/; do regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_$tag.tsv -j ${i}/output/cse_identify_filtered_$tag.bed -v ${i}/output/cse_identify_filtered_$tag.vcf ${i}/variants.per_gene.vep.vcf.gz ${i}/tumor_rna_alignments.bam /reference.fa reference.gtf; done
5557
```
5658

57-
### Make `variant.bed` for each sample
59+
**Make `variant.bed` for each sample**
5860

5961
```bash
6062
for i in samples/*/; do bash variants.sh ${i}/output/cse_identify_filtered_$tag.tsv ${i}/output/variants_$tag.bed; done
6163
```
6264

63-
### Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag
65+
**Combine each sample's `variant.bed` file per tag to get all variants that were deemed significant to splicing across all samples for a given tag**
6466

6567
```bash
6668
echo -e 'chrom\tstart\tend\tsamples' > all_splicing_variants_$tag.bed
6769
for i in samples/*/; do j=${i##samples/}; uniq ${i}output/variants_$tag.bed | awk -v var=${j%%/} '{print $0 "\t" var}' >> all_splicing_variants_$tag.bed; done
6870
```
6971

70-
### Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it
72+
**Make vcf of all variants across all samples (from each sample's variants.vcf). Then, compress it and index it.**
7173

7274
```bash
7375
vcf-concat samples/*/variants.vcf.gz | vcf-sort > all_variants_sorted.vcf
7476

75-
###### optional ######
7677
bgzip all_variants_sorted.vcf
7778

7879
tabix all_variants_sorted.vcf.gz
7980
```
8081

81-
### Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)
82+
**Run `regtools cis-splice effects identify` on all samples with all variants (with `$tag` options as example)**
8283

8384
```bash
8485
for i in samples/*/; do bsub -oo $i/logs/regtools_compare_$tag.lsf regtools cis-splice-effects identify $param -o ${i}/output/cse_identify_filtered_compare_$tag.tsv -j ${i}/output/cse_identify_filtered_compare_$tag.bed -v ${i}/output/cse_identify_filtered_compare_$tag.vcf all_variants_sorted.vcf.gz ${i}/tumor_rna_alignments.bam reference.fa reference.gtf; done
8586
```
8687

87-
## Beginning of compare junctions analysis
88+
## Statistical analysis
8889

89-
### Make directory to store comparison results
90+
**Make directory to store comparison results**
9091

9192
```bash
9293
mkdir -p compare_junctions/hist
9394
```
9495

95-
### Run `compare_junctions_hist.R` on sample data
96+
**Run `stats_wrapper.py` on sample data**
9697

9798
```bash
98-
Rscript --vanilla compare_junctions_hist.R <tag>
99+
python3 stats_wrapper.py <tag>
99100
```
100101

101-
### Run `filter_and_BH.R` to adjust p values and filter results
102+
**Run `filter_and_BH.R` to adjust p values and filter results**
102103

103104
```bash
104105
Rscript --vanilla filter_and_BH.R <tag>

scripts/compare_junctions_hist_v2.R

Lines changed: 60 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,21 @@ library(tidyverse)
99

1010
debug = F
1111

12-
# system.time({
13-
# if (debug){
14-
# tag = paste("_", "default", sep="")
15-
# } else {
16-
# # get options tag
17-
# args = commandArgs(trailingOnly = TRUE)
18-
# tag = args[1]
19-
# input_file = args[2]
20-
# if ( substr(tag, 2, 3) == "--"){
21-
# stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
22-
# }
23-
# }
24-
25-
tag = 'E'
26-
input_file = 'all_splicing_variants_E.bed'
12+
system.time({
13+
if (debug){
14+
tag = paste("_", "default", sep="")
15+
} else {
16+
# get options tag
17+
args = commandArgs(trailingOnly = TRUE)
18+
tag = args[1]
19+
input_file = args[2]
20+
if ( substr(tag, 2, 3) == "--"){
21+
stop("Please specify an option tag (e.g. \"default\", \"i20e5\")")
22+
}
23+
}
24+
25+
# tag = 'E'
26+
# input_file = '/Users/kcotto/Desktop/CHOL/all_splicing_variants_E.bed'
2727

2828
# All splicing relevant variants (union of rows from variants.bed files; add column with comma-separated list of sample names)
2929
all_splicing_variants = unique(data.table::fread(input_file), sep = '\t', header = T, stringsAsFactors = FALSE)
@@ -187,14 +187,14 @@ get_mean <- function(x){
187187
return(x)
188188
}
189189

190-
x <- mapply(get_mean, regtools_data$norm_scores_non)
191-
regtools_data$mean_norm_score_non <- x
192-
193190
get_sd <- function(x){
194191
x <- sd(as.numeric(x))
195192
return(x)
196193
}
197194

195+
x <- mapply(get_mean, regtools_data$norm_scores_non)
196+
regtools_data$mean_norm_score_non <- x
197+
198198
x <- mapply(get_sd, regtools_data$norm_scores_non)
199199
regtools_data$sd_norm_score_non <- x
200200

@@ -222,21 +222,54 @@ regtools_data$mean_norm_score_variant <- x
222222
x <- mapply(get_sd, regtools_data$norm_scores_variant)
223223
regtools_data$sd_norm_score_variant <- x
224224

225+
get_min <- function(x){
226+
x <- min(as.numeric(x))
227+
return(x)
228+
}
229+
230+
x <- mapply(get_min, regtools_data$norm_scores_variant)
231+
regtools_data$min_norm_score_variant <- x
232+
225233
print("test7")
226234

227235
################ calculate p-values ############################################
228236

237+
# calculate using mean
229238
a <- function(x){
230-
variant_norm_score = as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE)))
239+
variant_norm_score = mean(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
231240
if(length(x[['norm_scores_non']]) <= 1){
232241
return(0)
233242
}
234243

235244
all_norm_scores = c(x$norm_scores_non, variant_norm_score)
236245
countable = rank(all_norm_scores)
237246
num_samples = str_count(x$norm_scores_variant, ',') + 1
238-
non_variant_norm_scores_ranked = head(countable, (-1 * num_samples))
239-
variant_norm_score_ranked = tail(countable, num_samples)
247+
non_variant_norm_scores_ranked = head(countable, -1)
248+
variant_norm_score_ranked = tail(countable, 1)
249+
histinfo = hist(non_variant_norm_scores_ranked,
250+
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
251+
mids = histinfo$mids
252+
cd = cumsum(histinfo$density)
253+
underestimate = max(which(mids <= variant_norm_score_ranked))
254+
pvalue = 1-cd[underestimate]
255+
return(pvalue)
256+
}
257+
258+
# calculate using min
259+
b <- function(x){
260+
variant_norm_score = min(as.numeric(unlist(strsplit(x[['norm_scores_variant']], ',', fixed=TRUE))))
261+
if(variant_norm_score == 0){
262+
return(1)
263+
}
264+
if(length(x[['norm_scores_non']]) <= 1){
265+
return(0)
266+
}
267+
268+
all_norm_scores = c(x$norm_scores_non, variant_norm_score)
269+
countable = rank(all_norm_scores)
270+
num_samples = str_count(x$norm_scores_variant, ',') + 1
271+
non_variant_norm_scores_ranked = head(countable, -1)
272+
variant_norm_score_ranked = tail(countable, 1)
240273
histinfo = hist(non_variant_norm_scores_ranked,
241274
breaks = seq(0.5, max(non_variant_norm_scores_ranked)+1.5, by=1), plot=F)
242275
mids = histinfo$mids
@@ -246,7 +279,8 @@ a <- function(x){
246279
return(pvalue)
247280
}
248281

249-
regtools_data$p_value <- apply(regtools_data, 1, a)
282+
regtools_data$p_value_mean <- apply(regtools_data, 1, a)
283+
regtools_data$p_value_min <- apply(regtools_data, 1, b)
250284
print("Number of rows in data.table")
251285
print(length(regtools_data$samples))
252286

@@ -258,12 +292,12 @@ regtools_data$norm_scores_non <- unlist(lapply(regtools_data$norm_scores_non,pas
258292
columns_to_keep = c('samples', 'variant_info.x', 'genes', 'sample', "chrom.x", "start.x", "end.x", 'strand.x', 'anchor.x', 'info',
259293
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
260294
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
261-
'total_score_non', 'p_value')
295+
'total_score_non', 'p_value_mean','p_value_min')
262296
regtools_data = subset(regtools_data, select=columns_to_keep)
263297
colnames(regtools_data) <- c('variant_samples', 'variant_info', 'genes', 'junction_samples', "chrom", "start", "end", 'strand', 'anchor', 'variant_junction_info',
264298
'names', 'mean_norm_score_variant', 'sd_norm_score_variant', 'norm_scores_variant',
265299
'total_score_variant', 'mean_norm_score_non', 'sd_norm_score_non', 'norm_scores_non',
266-
'total_score_non', 'pvalue')
300+
'total_score_non', 'p_value_mean','p_value_min')
267301
regtools_data$sd_norm_score_variant[is.na(regtools_data$sd_norm_score_variant)] = 0
268302
regtools_data$mean_norm_score_non[is.na(regtools_data$mean_norm_score_non)] = 0
269303
regtools_data$sd_norm_score_non[is.na(regtools_data$sd_norm_score_non)] = 0
@@ -274,5 +308,5 @@ regtools_data = regtools_data %>% distinct()
274308

275309

276310
write.table(regtools_data, file=paste(input_file, "_out.tsv", sep=""), quote=FALSE, sep='\t', row.names = F)
277-
#
278-
# })
311+
312+
})

0 commit comments

Comments
 (0)