Skip to content

Commit 7b0fd9b

Browse files
Merge pull request #315 from bbglab/dev
2 parents 72dc4d9 + cb18adc commit 7b0fd9b

File tree

127 files changed

+1777
-1881
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

127 files changed

+1777
-1881
lines changed

.markdownlint.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"MD013": false,
3+
"MD024": {
4+
"siblings_only": true
5+
}
6+
}

README.md

Lines changed: 25 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
1+
# deepCSA
2+
13
## Introduction
24

3-
**bbglab/deepCSA** is a bioinformatics pipeline that can be used for analyzing targeted DNA sequencing data. It was designed for duplex sequencing data of normal tissues.
5+
**bbglab/deepCSA** is a bioinformatics pipeline that can be used for analyzing the clonal structure information from targeted DNA sequencing data. It was designed for duplex sequencing data of normal tissues.
46

5-
<!-- TODO nf-core:
6-
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
7-
major pipeline sections and the types of output it produces. You're giving an overview to someone new
8-
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
9-
-->
7+
![deepCSA workflow overview](docs/images/deepCSA.png)
108

119
<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
1210
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
@@ -15,24 +13,21 @@
1513
<!-- 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
1614
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) -->
1715

18-
1916
## Usage
2017

21-
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
22-
Explain what rows and columns represent. For instance (please edit as appropriate):
23-
2418
First, prepare a samplesheet with your input data that looks as follows:
2519

2620
`samplesheet.csv`:
2721

2822
```csv
29-
sample,fastq_1,fastq_2
30-
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
23+
sample,vcf,bam
24+
sample1,sample1.high.filtered.vcf,sample1.sorted.bam
25+
sample2,sample2.high.filtered.vcf,sample2.sorted.bam
3126
```
3227

33-
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
28+
Each row represents a single sample with a single-sample VCF containing the mutations called in that sample and the BAM file that was used for getting those variant calls. The mutations will be obtained from the VCF and the BAM file will be used for computing the sequencing depth at each position and using this for the downstream analysis.
3429

35-
-->
30+
**Make sure that you do not use any '.' in your sample names, and also use text-like names for the samples, try to avoid having only numbers.** This second case should be handled properly but using string-like names will ensure consistency.
3631

3732
Now, you can run the pipeline using:
3833

@@ -41,42 +36,36 @@ Now, you can run the pipeline using:
4136
```bash
4237
git clone https://github.com/bbglab/deepCSA.git
4338
cd deepCSA
44-
nextflow run main.nf --outdir <OUTDIR> -profile singularity,<DESIRED PROFILE>
45-
```
46-
The input should be provided by the `--input` option but it is more recommended to define it within a given profile.
47-
48-
Internally also use the -work-dir option:
49-
```
50-
-work-dir /workspace/nobackup2/work/deepCSA/<NAME>
39+
nextflow run main.nf --outdir <OUTDIR> -profile singularity,<DESIRED PROFILE> --input samplesheet.csv
5140
```
5241

53-
Also put the following content in an executor.config provided as `-c executor.config`
54-
```
55-
process {
56-
executor = 'slurm'
57-
errorStrategy = 'retry'
58-
maxRetries = 2
59-
}
60-
```
42+
The input can be provided by the `--input` option but it is more recommended to define this and all the other parameters in a parameter file, that can be provided to the pipeline for running the analysis with the specified configuration.
6143

44+
### Warning
6245

63-
:::warning
64-
Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
65-
provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
46+
Please provide pipeline parameters via the Nextflow `-params-file` option or CLI. Custom config files including those
47+
provided by the `-c` Nextflow option can be used to provide any configuration **except for parameters**_;
6648
see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
67-
:::
6849

6950
## Credits
7051

7152
bbglab/deepCSA was originally written by Ferriol Calvet.
7253

7354
We thank the following people for their extensive assistance in the development of this pipeline:
7455

75-
<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
56+
* @rblancomi
57+
* @FedericaBrando
58+
* @koszulordie
59+
* @St3451
60+
* @AxelRosendahlHuber
61+
* @andrianovam
62+
* @migrau
7663

64+
<!-- TODO
7765
## Contributions and Support
7866
7967
If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
68+
-->
8069

8170
## Citations
8271

@@ -95,15 +84,6 @@ This pipeline uses code and infrastructure developed and maintained by the [nf-c
9584
>
9685
> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).
9786
87+
## Documentation
9888

99-
100-
## (temporary) Documentation
101-
102-
### Custom TSV regions annotation file should contain this
103-
Document this custom_regions has to be a TSV file with the following columns:
104-
* chromosome start end gene_name impactful_mutations [neutral_impact] [new_impact]
105-
* chromosome start and end indicate the region that is being customized
106-
* gene_name : is the name of the region that is being added, make sure that it does not coincide with the name of any other gene.
107-
* impactful_mutations : is a comma-separated list of SNVs that need to be labelled with the value indicated in new_impact, format: chr5_1294991_C/T, with pyrimidine based definition
108-
* neutral_impact : (optional, default; synonymous)
109-
* new_impact : (optional, default: missense) is the impact that the mutations listed in impactful_mutations will receive.
89+
Find the documentation ([link to docs](https://github.com/bbglab/deepCSA/tree/main/docs)).

assets/useful_scripts/deepcsa_maf2samplevcfs.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,30 @@
1010
## If your sample names are NOT in a column called SAMPLE_ID,
1111
## you can use the --sample-name-column option to specify it.
1212

13-
# if the maf is from deepCSA, use this one, otherwise use the one below
13+
# if the maf is from deepCSA, use this one
1414
# usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/ --maf-from-deepcsa
1515

16-
# if the maf file is not from deepCSA, use this below
16+
# if the maf file is not from deepCSA, use this one
1717
# usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/
1818

19+
20+
21+
#######
22+
# Mandatory columns in input mutations:
23+
#######
24+
25+
# if the maf is from deepCSA, it must contain the following columns, as they were originally generated
26+
# ['CHROM', 'POS', 'REF', 'ALT', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE']
27+
28+
# if the maf file is not from deepCSA, then it MUST contain the following columns
29+
# ['CHROM', 'POS', 'REF', 'ALT', 'DEPTH', 'ALT_DEPTH']
30+
# where:
31+
# DEPTH indicates the total number of duplex reads sequenced at the position where the mutation occurs
32+
# ALT_DEPTH indicates the total number of duplex reads supporting the variant at the same position
33+
34+
35+
36+
1937
import click
2038
import pandas as pd
2139

bin/compute_mutdensity.py

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
#!/usr/bin/env python
2+
3+
"""
4+
Mutation density computation script.
5+
Mutation density is a metric that quantifies the number of mutations per megabase (Mb) of sequenced DNA.
6+
This script computes mutation density for a given sample, all possible genes and a list of consequence type groups.
7+
It calculates mutation density per Mb, both adjusted and non-adjusted by the number of sites available for each consequence type.
8+
The results are saved to a TSV file.
9+
"""
10+
11+
12+
import click
13+
import pandas as pd
14+
from read_utils import custom_na_values
15+
16+
# TODO: bump pandas to 2.2.3
17+
18+
# -- Auxiliary functions -- #
19+
20+
MUTDENSITY_IMPACT_GROUPS = [False, ["SNV"] , ["INSERTION", "DELETION"], ["SNV", "INSERTION", "DELETION"]]
21+
22+
def mutdensity_sample(maf_df, depths_df, depths_adj_df, sample_name):
23+
"""
24+
Computes a sample's global mutation density. Returns the mutation density
25+
per Mb, non-adjusted and adjusted by panel
26+
composition.
27+
"""
28+
29+
impact_group_results = list()
30+
31+
# mutation density depth information
32+
sample_features_depth = {"DEPTH" : depths_df.drop_duplicates(subset = ["CHROM", "POS"])[f"{sample_name}"].sum(),
33+
"DEPTH_ADJUSTED": depths_adj_df[f"{sample_name}"].sum()
34+
}
35+
36+
for type_list in MUTDENSITY_IMPACT_GROUPS:
37+
if not type_list:
38+
unique_maf = maf_df[["SAMPLE_ID", "MUT_ID", "ALT_DEPTH"]].drop_duplicates()
39+
types_included = 'all_types'
40+
else:
41+
unique_maf = maf_df[maf_df['TYPE'].isin(type_list)][["SAMPLE_ID", "MUT_ID", "ALT_DEPTH"]].copy().drop_duplicates()
42+
types_included = '-'.join(sorted(type_list))
43+
44+
# count number of mutations and mutated reads in the sample
45+
## make sure to count each mutation only once (avoid annotation issues)
46+
n_muts = unique_maf.shape[0]
47+
n_muts_per_sample = unique_maf.groupby(by = ["SAMPLE_ID", "MUT_ID"] ).agg({"ALT_DEPTH" : "sum" }).reset_index()
48+
n_mutated_reads = n_muts_per_sample["ALT_DEPTH"].sum()
49+
print(n_muts, n_mutated_reads)
50+
51+
# mutation density metrics
52+
sample_features = dict()
53+
sample_features.update(sample_features_depth)
54+
sample_features["N_MUTS"] = n_muts
55+
sample_features["N_MUTATED"] = n_mutated_reads
56+
57+
sample_features["MUTDENSITY_MB"] = ( sample_features["N_MUTS"] / sample_features["DEPTH"] * 1000000 ).astype(float)
58+
sample_features["MUTDENSITY_MB_ADJUSTED"] = ( sample_features["N_MUTS"] / sample_features["DEPTH_ADJUSTED"] * 1000000 ).astype(float)
59+
sample_features["MUTREADSRATE_MB"] = ( sample_features["N_MUTATED"] / sample_features["DEPTH"] * 1000000 ).astype(float)
60+
sample_features["MUTREADSRATE_MB_ADJUSTED"] = ( sample_features["N_MUTATED"] / sample_features["DEPTH_ADJUSTED"] * 1000000 ).astype(float)
61+
62+
sample_features["GENE"] = "ALL_GENES"
63+
sample_features["MUTTYPES"] = types_included
64+
65+
impact_group_results.append(pd.DataFrame([sample_features]))
66+
67+
# concatenate results for all impact groups
68+
mutdensity_sample = pd.concat(impact_group_results)
69+
70+
return mutdensity_sample
71+
72+
73+
def mutdensity_gene(maf_df, depths_df, depths_adj_df, sample_name):
74+
"""
75+
Computes each gene mutation density. Returns the mutation density
76+
both per Mb and Kb sequenced, both non-adjusted and adjusted by panel
77+
composition.
78+
"""
79+
80+
impact_group_results = list()
81+
82+
for type_list in MUTDENSITY_IMPACT_GROUPS:
83+
# filter by mutation type according to type_list
84+
if not type_list:
85+
unique_maf = maf_df[["SAMPLE_ID", "GENE", "MUT_ID", "ALT_DEPTH"]].drop_duplicates()
86+
types_included = 'all_types'
87+
else:
88+
unique_maf = maf_df[maf_df['TYPE'].isin(type_list)][["SAMPLE_ID", "GENE", "MUT_ID", "ALT_DEPTH"]].copy().drop_duplicates()
89+
types_included = '-'.join(sorted(type_list))
90+
91+
# count number of mutations and mutated reads per gene
92+
# make sure to count each mutation only once (avoid annotation issues)
93+
n_muts_gene = unique_maf.groupby(by = ["GENE"] ).agg({"ALT_DEPTH" : "count" })
94+
n_muts_gene.columns = ["N_MUTS"]
95+
96+
# make sure to count each mutation only once (avoid annotation issues)
97+
n_mutated_reads = unique_maf.groupby(by = ["GENE"] ).agg({"ALT_DEPTH" : "sum" })
98+
n_mutated_reads.columns = ["N_MUTATED"]
99+
100+
depths_gene_df = depths_df.groupby("GENE").agg({f"{sample_name}" : "sum" })
101+
depths_gene_df.columns = ["DEPTH"]
102+
depths_adj_gene_df = depths_adj_df.groupby("GENE").agg({f"{sample_name}" : "sum" })
103+
depths_adj_gene_df.columns = ["DEPTH_ADJUSTED"]
104+
105+
mut_rate_mut_reads_df = n_muts_gene.merge(n_mutated_reads, on = "GENE")
106+
depths_depthsadj_gene_df = depths_gene_df.merge(depths_adj_gene_df, on = "GENE")
107+
## merge so that mutation density is computed although the number of mutations is NA (meaning, zero)
108+
mut_depths_df = depths_depthsadj_gene_df.merge(mut_rate_mut_reads_df, on = "GENE", how = 'left')
109+
mut_depths_df = mut_depths_df.fillna(0) # I think this is not needed
110+
111+
# mutation density metrics
112+
mut_depths_df["MUTDENSITY_MB"] = (mut_depths_df["N_MUTS"] / mut_depths_df["DEPTH"] * 1000000).astype(float)
113+
mut_depths_df["MUTDENSITY_MB_ADJUSTED"] = (mut_depths_df["N_MUTS"] / mut_depths_df["DEPTH_ADJUSTED"] * 1000000).astype(float)
114+
115+
mut_depths_df["MUTREADSRATE_MB"] = (mut_depths_df["N_MUTATED"] / mut_depths_df["DEPTH"] * 1000000).astype(float)
116+
mut_depths_df["MUTREADSRATE_MB_ADJUSTED"] = (mut_depths_df["N_MUTATED"] / mut_depths_df["DEPTH_ADJUSTED"] * 1000000).astype(float)
117+
118+
mut_depths_df["MUTTYPES"] = types_included
119+
impact_group_results.append(mut_depths_df.reset_index())
120+
121+
# concatenate results for all impact groups
122+
mutdensity_per_gene = pd.concat(impact_group_results)
123+
124+
return mutdensity_per_gene
125+
126+
127+
def load_n_process_inputs(maf_path, depths_path, annot_panel_path, sample_name):
128+
# File loading
129+
maf_df = pd.read_csv(maf_path, sep = "\t", na_values = custom_na_values)
130+
depths_df = pd.read_csv(depths_path, sep = "\t")
131+
depths_df = depths_df.drop("CONTEXT", axis = 1)
132+
annot_panel_df = pd.read_csv(annot_panel_path, sep = "\t", na_values = custom_na_values)
133+
134+
# Subset depths with panel
135+
## mode 1: each position counts one (once per gene, be careful that it might be duplicated in different genes)
136+
depths_subset_df = depths_df.merge(annot_panel_df[["CHROM", "POS", "GENE"]].drop_duplicates(),
137+
on = ["CHROM", "POS"], how = "inner")
138+
## mode 2 (adjusted): each position counts as many times it contributes to the panel
139+
depths_df[sample_name] = depths_df[sample_name] / 3 # the depth per position can contribute to three different mutations
140+
depths_subset_adj_df = depths_df.merge(annot_panel_df[["CHROM", "POS", "GENE"]], on = ["CHROM", "POS"], how = "inner")
141+
142+
## mode 3 (adjusted): each position counts as many times it contributes to the panel, but ONLY ONCE PER SAMPLE
143+
depths_subset_adj_sample_df = depths_df.merge(annot_panel_df.drop_duplicates(subset = ["CHROM", "POS", "REF", "ALT"])[["CHROM", "POS"]],
144+
on = ["CHROM", "POS"], how = "inner")
145+
146+
return maf_df, depths_subset_df, depths_subset_adj_df, depths_subset_adj_sample_df
147+
148+
149+
150+
# -- Main function -- #
151+
def compute_mutdensity(maf_path, depths_path, annot_panel_path, sample_name, panel_v):
152+
"""
153+
Computes mutation density for a given sample based on MAF, depths, and annotation panel files.
154+
The function calculates mutation density per Mb and Kb, both adjusted and non-adjusted by
155+
the panel composition. It saves the results to a TSV file.
156+
"""
157+
158+
maf_df, depths_subset_df, depths_subset_adj_df, depths_subset_adj_sample_df = load_n_process_inputs(maf_path, depths_path, annot_panel_path, sample_name)
159+
160+
# Compute mutation densities
161+
## sample mutation density
162+
mutdensity_sample_df = mutdensity_sample(maf_df, depths_subset_df, depths_subset_adj_sample_df, sample_name)
163+
164+
## per gene mutation density
165+
mutdensity_genes_df = mutdensity_gene(maf_df, depths_subset_df, depths_subset_adj_df, sample_name)
166+
167+
mutdensity_df = pd.concat([mutdensity_sample_df, mutdensity_genes_df])
168+
169+
mutdensity_df["SAMPLE_ID"] = sample_name
170+
mutdensity_df["REGIONS"] = panel_v
171+
172+
# Save
173+
mutdensity_df[["SAMPLE_ID", "GENE", "REGIONS", "MUTTYPES",
174+
"DEPTH",
175+
"N_MUTS", "N_MUTATED",
176+
"MUTDENSITY_MB", "MUTDENSITY_MB_ADJUSTED",
177+
"MUTREADSRATE_MB", "MUTREADSRATE_MB_ADJUSTED",
178+
]].to_csv(f"{sample_name}.{panel_v}.mutdensities.tsv",
179+
sep = "\t",
180+
header = True,
181+
index = False
182+
)
183+
184+
185+
@click.command()
186+
@click.option('--maf_path', type=click.Path(exists=True), required=True, help='Path to the MAF file.')
187+
@click.option('--depths_path', type=click.Path(exists=True), required=True, help='Path to the depths file.')
188+
@click.option('--annot_panel_path', type=click.Path(exists=True), required=True, help='Path to the annotation panel file.')
189+
@click.option('--sample_name', type=str, required=True, help='Sample name.')
190+
@click.option('--panel_version', type=str, required=True, help='Panel version.')
191+
def main(maf_path, depths_path, annot_panel_path, sample_name, panel_version):
192+
"""
193+
CLI entry point for computing mutation densities.
194+
"""
195+
compute_mutdensity(maf_path, depths_path, annot_panel_path, sample_name, panel_version)
196+
197+
198+
if __name__ == '__main__':
199+
200+
main()

0 commit comments

Comments
 (0)