bbglab
diff --git a/‎.markdownlint.json‎
Lines changed: 6 additions & 0 deletions b/‎.markdownlint.json‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 25 additions & 45 deletions b/‎README.md‎
Lines changed: 25 additions & 45 deletions
diff --git a/‎assets/input_example.csv‎ ‎assets/example_inputs/input_example.csv‎assets/input_example.csv renamed to assets/example_inputs/input_example.csv b/‎assets/input_example.csv‎ ‎assets/example_inputs/input_example.csv‎assets/input_example.csv renamed to assets/example_inputs/input_example.csv
diff --git a/‎assets/useful_scripts/deepcsa_maf2samplevcfs.py‎
Lines changed: 20 additions & 2 deletions b/‎assets/useful_scripts/deepcsa_maf2samplevcfs.py‎
Lines changed: 20 additions & 2 deletions
diff --git a/‎bin/compute_mutdensity.py‎
Lines changed: 200 additions & 0 deletions b/‎bin/compute_mutdensity.py‎
Lines changed: 200 additions & 0 deletions
@@ -0,0 +1,6 @@
+{
+    "MD013": false,
+    "MD024": {
+        "siblings_only": true
+    }
+}
@@ -1,12 +1,10 @@
+# deepCSA
+
 ## Introduction
 
-**bbglab/deepCSA** is a bioinformatics pipeline that can be used for analyzing targeted DNA sequencing data. It was designed for duplex sequencing data of normal tissues.
+**bbglab/deepCSA** is a bioinformatics pipeline that can be used for analyzing the clonal structure information from targeted DNA sequencing data. It was designed for duplex sequencing data of normal tissues.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+![deepCSA workflow overview](docs/images/deepCSA.png)
 
 <!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
      workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
@@ -15,24 +13,21 @@
 <!-- 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
 2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) -->
 
-
 ## Usage
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
 First, prepare a samplesheet with your input data that looks as follows:
 
 `samplesheet.csv`:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+sample,vcf,bam
+sample1,sample1.high.filtered.vcf,sample1.sorted.bam
+sample2,sample2.high.filtered.vcf,sample2.sorted.bam
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
+Each row represents a single sample with a single-sample VCF containing the mutations called in that sample and the BAM file that was used for getting those variant calls. The mutations will be obtained from the VCF and the BAM file will be used for computing the sequencing depth at each position and using this for the downstream analysis.
 
--->
+**Make sure that you do not use any '.' in your sample names, and also use text-like names for the samples, try to avoid having only numbers.** This second case should be handled properly but using string-like names will ensure consistency.
 
 Now, you can run the pipeline using:
 
@@ -41,42 +36,36 @@ Now, you can run the pipeline using:
 ```bash
 git clone https://github.com/bbglab/deepCSA.git
 cd deepCSA
-nextflow run main.nf --outdir <OUTDIR> -profile singularity,<DESIRED PROFILE>
-```
-The input should be provided by the `--input` option but it is more recommended to define it within a given profile.
-
-Internally also use the -work-dir option:
-```
--work-dir  /workspace/nobackup2/work/deepCSA/<NAME>
+nextflow run main.nf --outdir <OUTDIR> -profile singularity,<DESIRED PROFILE> --input samplesheet.csv
 ```
 
-Also put the following content in an executor.config provided as `-c executor.config`
-```
-process {
-   executor = 'slurm'
-   errorStrategy = 'retry'
-   maxRetries = 2
-}
-```
+The input can be provided by the `--input` option but it is more recommended to define this and all the other parameters in a parameter file, that can be provided to the pipeline for running the analysis with the specified configuration.
 
+### Warning
 
-:::warning
-Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
-provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
+Please provide pipeline parameters via the Nextflow `-params-file` option or CLI. Custom config files including those
+provided by the `-c` Nextflow option can be used to provide any configuration **except for parameters**_;
 see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
-:::
 
 ## Credits
 
 bbglab/deepCSA was originally written by Ferriol Calvet.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+* @rblancomi
+* @FedericaBrando
+* @koszulordie
+* @St3451
+* @AxelRosendahlHuber
+* @andrianovam
+* @migrau
 
+<!-- TODO 
 ## Contributions and Support
 
 If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
+ -->
 
 ## Citations
 
@@ -95,15 +84,6 @@ This pipeline uses code and infrastructure developed and maintained by the [nf-c
 >
 > _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).
 
+## Documentation
 
-
-## (temporary) Documentation
-
-### Custom TSV regions annotation file should contain this
-Document this custom_regions has to be a TSV file with the following columns:
-    * chromosome  start   end gene_name    impactful_mutations [neutral_impact] [new_impact]
-    * chromosome start and end indicate the region that is being customized
-    * gene_name           : is the name of the region that is being added, make sure that it does not coincide with the name of any other gene.
-    * impactful_mutations : is a comma-separated list of SNVs that need to be labelled with the value indicated in new_impact, format: chr5_1294991_C/T, with pyrimidine based definition
-    * neutral_impact      : (optional, default; synonymous)
-    * new_impact          : (optional, default: missense) is the impact that the mutations listed in impactful_mutations will receive.
+Find the documentation ([link to docs](https://github.com/bbglab/deepCSA/tree/main/docs)).
@@ -10,12 +10,30 @@
 ## If your sample names are NOT in a column called SAMPLE_ID,
 ## you can use the --sample-name-column option to specify it.
 
-# if the maf is from deepCSA, use this one, otherwise use the one below
+# if the maf is from deepCSA, use this one
 # usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/ --maf-from-deepcsa
 
-# if the maf file is not from deepCSA, use this below
+# if the maf file is not from deepCSA, use this one
 # usage: python deepcsa_maf2samplevcfs.py --mutations-file all_samples.somatic.mutations.tsv --output-dir ./test/
 
+
+
+#######
+# Mandatory columns in input mutations: 
+#######
+
+# if the maf is from deepCSA, it must contain the following columns, as they were originally generated
+# ['CHROM', 'POS', 'REF', 'ALT', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE']
+
+# if the maf file is not from deepCSA, then it MUST contain the following columns
+# ['CHROM', 'POS', 'REF', 'ALT', 'DEPTH', 'ALT_DEPTH']
+# where:
+#     DEPTH indicates the total number of duplex reads sequenced at the position where the mutation occurs
+#     ALT_DEPTH indicates the total number of duplex reads supporting the variant at the same position
+
+
+
+
 import click
 import pandas as pd
 
 
@@ -0,0 +1,200 @@
+#!/usr/bin/env python
+
+"""
+Mutation density computation script.
+Mutation density is a metric that quantifies the number of mutations per megabase (Mb) of sequenced DNA.
+This script computes mutation density for a given sample, all possible genes and a list of consequence type groups.
+It calculates mutation density per Mb, both adjusted and non-adjusted by the number of sites available for each consequence type.
+The results are saved to a TSV file.
+"""
+
+
+import click
+import pandas as pd
+from read_utils import custom_na_values
+
+# TODO: bump pandas to 2.2.3
+
+# -- Auxiliary functions -- #
+
+MUTDENSITY_IMPACT_GROUPS = [False, ["SNV"] , ["INSERTION", "DELETION"], ["SNV", "INSERTION", "DELETION"]]
+
+def mutdensity_sample(maf_df, depths_df, depths_adj_df, sample_name):
+    """
+    Computes a sample's global mutation density. Returns the mutation density
+    per Mb, non-adjusted and adjusted by panel
+    composition.
+    """
+
+    impact_group_results = list()
+
+    # mutation density depth information
+    sample_features_depth = {"DEPTH" : depths_df.drop_duplicates(subset = ["CHROM", "POS"])[f"{sample_name}"].sum(),
+                                "DEPTH_ADJUSTED": depths_adj_df[f"{sample_name}"].sum()
+                                }
+
+    for type_list in MUTDENSITY_IMPACT_GROUPS:
+        if not type_list:
+            unique_maf = maf_df[["SAMPLE_ID", "MUT_ID", "ALT_DEPTH"]].drop_duplicates()
+            types_included = 'all_types'
+        else:
+            unique_maf = maf_df[maf_df['TYPE'].isin(type_list)][["SAMPLE_ID", "MUT_ID", "ALT_DEPTH"]].copy().drop_duplicates()
+            types_included = '-'.join(sorted(type_list))
+
+        # count number of mutations and mutated reads in the sample
+        ## make sure to count each mutation only once (avoid annotation issues)
+        n_muts = unique_maf.shape[0]
+        n_muts_per_sample = unique_maf.groupby(by = ["SAMPLE_ID", "MUT_ID"] ).agg({"ALT_DEPTH" : "sum" }).reset_index()
+        n_mutated_reads = n_muts_per_sample["ALT_DEPTH"].sum()
+        print(n_muts, n_mutated_reads)
+
+        # mutation density metrics
+        sample_features = dict()
+        sample_features.update(sample_features_depth)
+        sample_features["N_MUTS"] = n_muts
+        sample_features["N_MUTATED"] = n_mutated_reads
+
+        sample_features["MUTDENSITY_MB"] = ( sample_features["N_MUTS"] / sample_features["DEPTH"] * 1000000 ).astype(float)
+        sample_features["MUTDENSITY_MB_ADJUSTED"] = ( sample_features["N_MUTS"] / sample_features["DEPTH_ADJUSTED"] * 1000000 ).astype(float)
+        sample_features["MUTREADSRATE_MB"] = ( sample_features["N_MUTATED"] / sample_features["DEPTH"] * 1000000 ).astype(float)
+        sample_features["MUTREADSRATE_MB_ADJUSTED"] = ( sample_features["N_MUTATED"] / sample_features["DEPTH_ADJUSTED"] * 1000000 ).astype(float)
+
+        sample_features["GENE"] = "ALL_GENES"
+        sample_features["MUTTYPES"] = types_included
+
+        impact_group_results.append(pd.DataFrame([sample_features]))
+
+    # concatenate results for all impact groups
+    mutdensity_sample = pd.concat(impact_group_results)
+
+    return mutdensity_sample
+
+
+def mutdensity_gene(maf_df, depths_df, depths_adj_df, sample_name):
+    """
+    Computes each gene mutation density. Returns the mutation density
+    both per Mb and Kb sequenced, both non-adjusted and adjusted by panel
+    composition.
+    """
+
+    impact_group_results = list()
+
+    for type_list in MUTDENSITY_IMPACT_GROUPS:
+        # filter by mutation type according to type_list
+        if not type_list:
+            unique_maf = maf_df[["SAMPLE_ID", "GENE", "MUT_ID", "ALT_DEPTH"]].drop_duplicates()
+            types_included = 'all_types'
+        else:
+            unique_maf = maf_df[maf_df['TYPE'].isin(type_list)][["SAMPLE_ID", "GENE", "MUT_ID", "ALT_DEPTH"]].copy().drop_duplicates()
+            types_included = '-'.join(sorted(type_list))
+
+        # count number of mutations and mutated reads per gene
+        # make sure to count each mutation only once (avoid annotation issues)
+        n_muts_gene = unique_maf.groupby(by = ["GENE"] ).agg({"ALT_DEPTH" : "count" })
+        n_muts_gene.columns = ["N_MUTS"]
+
+        # make sure to count each mutation only once (avoid annotation issues)
+        n_mutated_reads = unique_maf.groupby(by = ["GENE"] ).agg({"ALT_DEPTH" : "sum" })
+        n_mutated_reads.columns = ["N_MUTATED"]
+
+        depths_gene_df = depths_df.groupby("GENE").agg({f"{sample_name}" : "sum" })
+        depths_gene_df.columns = ["DEPTH"]
+        depths_adj_gene_df = depths_adj_df.groupby("GENE").agg({f"{sample_name}" : "sum" })
+        depths_adj_gene_df.columns = ["DEPTH_ADJUSTED"]
+
+        mut_rate_mut_reads_df = n_muts_gene.merge(n_mutated_reads, on = "GENE")
+        depths_depthsadj_gene_df = depths_gene_df.merge(depths_adj_gene_df, on = "GENE")
+        ## merge so that mutation density is computed although the number of mutations is NA (meaning, zero)
+        mut_depths_df = depths_depthsadj_gene_df.merge(mut_rate_mut_reads_df, on = "GENE", how = 'left')
+        mut_depths_df = mut_depths_df.fillna(0) # I think this is not needed
+
+        # mutation density metrics
+        mut_depths_df["MUTDENSITY_MB"] = (mut_depths_df["N_MUTS"] / mut_depths_df["DEPTH"] * 1000000).astype(float)
+        mut_depths_df["MUTDENSITY_MB_ADJUSTED"] = (mut_depths_df["N_MUTS"] / mut_depths_df["DEPTH_ADJUSTED"] * 1000000).astype(float)
+
+        mut_depths_df["MUTREADSRATE_MB"] = (mut_depths_df["N_MUTATED"] / mut_depths_df["DEPTH"] * 1000000).astype(float)
+        mut_depths_df["MUTREADSRATE_MB_ADJUSTED"] = (mut_depths_df["N_MUTATED"] / mut_depths_df["DEPTH_ADJUSTED"] * 1000000).astype(float)
+
+        mut_depths_df["MUTTYPES"] = types_included
+        impact_group_results.append(mut_depths_df.reset_index())
+
+    # concatenate results for all impact groups
+    mutdensity_per_gene = pd.concat(impact_group_results)
+
+    return mutdensity_per_gene
+
+
+def load_n_process_inputs(maf_path, depths_path, annot_panel_path, sample_name):
+    # File loading
+    maf_df = pd.read_csv(maf_path, sep = "\t", na_values = custom_na_values)
+    depths_df = pd.read_csv(depths_path, sep = "\t")
+    depths_df = depths_df.drop("CONTEXT", axis = 1)
+    annot_panel_df = pd.read_csv(annot_panel_path, sep = "\t", na_values = custom_na_values)
+
+    # Subset depths with panel
+    ## mode 1: each position counts one (once per gene, be careful that it might be duplicated in different genes)
+    depths_subset_df = depths_df.merge(annot_panel_df[["CHROM", "POS", "GENE"]].drop_duplicates(),
+                                        on = ["CHROM", "POS"], how = "inner")
+    ## mode 2 (adjusted): each position counts as many times it contributes to the panel
+    depths_df[sample_name] = depths_df[sample_name] / 3   # the depth per position can contribute to three different mutations
+    depths_subset_adj_df = depths_df.merge(annot_panel_df[["CHROM", "POS", "GENE"]], on = ["CHROM", "POS"], how = "inner")
+
+    ## mode 3 (adjusted): each position counts as many times it contributes to the panel, but ONLY ONCE PER SAMPLE
+    depths_subset_adj_sample_df = depths_df.merge(annot_panel_df.drop_duplicates(subset = ["CHROM", "POS", "REF", "ALT"])[["CHROM", "POS"]],
+                                                    on = ["CHROM", "POS"], how = "inner")
+
+    return maf_df, depths_subset_df, depths_subset_adj_df, depths_subset_adj_sample_df
+
+
+
+# -- Main function -- #
+def compute_mutdensity(maf_path, depths_path, annot_panel_path, sample_name, panel_v):
+    """
+    Computes mutation density for a given sample based on MAF, depths, and annotation panel files.
+    The function calculates mutation density per Mb and Kb, both adjusted and non-adjusted by
+    the panel composition. It saves the results to a TSV file.
+    """
+
+    maf_df, depths_subset_df, depths_subset_adj_df, depths_subset_adj_sample_df = load_n_process_inputs(maf_path, depths_path, annot_panel_path, sample_name)
+
+    # Compute mutation densities
+    ## sample mutation density
+    mutdensity_sample_df = mutdensity_sample(maf_df, depths_subset_df, depths_subset_adj_sample_df, sample_name)
+
+    ## per gene mutation density
+    mutdensity_genes_df = mutdensity_gene(maf_df, depths_subset_df, depths_subset_adj_df, sample_name)
+
+    mutdensity_df = pd.concat([mutdensity_sample_df, mutdensity_genes_df])
+
+    mutdensity_df["SAMPLE_ID"] = sample_name
+    mutdensity_df["REGIONS"] = panel_v
+
+    # Save
+    mutdensity_df[["SAMPLE_ID", "GENE", "REGIONS", "MUTTYPES",
+                "DEPTH",
+                "N_MUTS", "N_MUTATED",
+                "MUTDENSITY_MB", "MUTDENSITY_MB_ADJUSTED",
+                "MUTREADSRATE_MB", "MUTREADSRATE_MB_ADJUSTED",
+                ]].to_csv(f"{sample_name}.{panel_v}.mutdensities.tsv",
+                                                            sep = "\t",
+                                                            header = True,
+                                                            index = False
+                                                            )
+
+
+@click.command()
+@click.option('--maf_path', type=click.Path(exists=True), required=True, help='Path to the MAF file.')
+@click.option('--depths_path', type=click.Path(exists=True), required=True, help='Path to the depths file.')
+@click.option('--annot_panel_path', type=click.Path(exists=True), required=True, help='Path to the annotation panel file.')
+@click.option('--sample_name', type=str, required=True, help='Sample name.')
+@click.option('--panel_version', type=str, required=True, help='Panel version.')
+def main(maf_path, depths_path, annot_panel_path, sample_name, panel_version):
+    """
+    CLI entry point for computing mutation densities.
+    """
+    compute_mutdensity(maf_path, depths_path, annot_panel_path, sample_name, panel_version)
+
+
+if __name__ == '__main__':
+
+    main()