nf-core
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎conf/test_failed.config‎
Lines changed: 0 additions & 1 deletion b/‎conf/test_failed.config‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/usage.md‎
Lines changed: 10 additions & 0 deletions b/‎docs/usage.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎modules/local/filter_samples.nf‎
Lines changed: 55 additions & 0 deletions b/‎modules/local/filter_samples.nf‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎nextflow_schema.json‎
Lines changed: 3 additions & 3 deletions b/‎nextflow_schema.json‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎tests/default.nf.test.snap‎
Lines changed: 3 additions & 0 deletions b/‎tests/default.nf.test.snap‎
Lines changed: 3 additions & 0 deletions
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### `Added`
 
 - [#948](https://github.com/nf-core/ampliseq/pull/948) - Decontam as optional decontamination tool.
+- [#949](https://github.com/nf-core/ampliseq/pull/949) - The dataset can be filtered for downstream analysis with the metadata sheet, for example to remove negative control samples meant for Decontam.
 
 ### `Changed`
 
 
@@ -37,7 +37,6 @@ params {
     ignore_failed_filtering = true
 
     //this is to remove low abundance ASVs to reduce runtime of downstream processes
-    min_samples = 2
     min_frequency = 10
 
     // Skipping steps
 
@@ -298,10 +298,20 @@ For example, the tab-separated `regions_multiregion.tsv` may contain:
 | region4 | GGAGCATGTGGWTTAATTCGA | CGTTGCGGGACTTAACCC   | 115           |
 | region5 | GGAGGAAGGTGGGGATGAC   | AAGGCCCGGGAACGTATT   | 150           |
 
+> [!WARNING]
+> Several downstream filtering options are not allowed or disabled when analysing multi region data.
+> Disabled functions are any ASV postprocessing/filtering options that require sequences and also no
+> sample subsetting using the metadata sheet is available (i.e. if provided, the metadata sheet has
+> to include all samples that pass preprocessing).
+
 ### Metadata
 
 Metadata is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.
 
+> [!TIP]
+> The metadata defines what samples are entering downstream analysis. For example, when having negative controls in the samplesheet,
+> those can be omitted in the metadata sheet and will not enter downstream analysis with QIIME2.
+
 ```bash
 --metadata "path/to/metadata.tsv"
 ```
 
@@ -0,0 +1,55 @@
+process FILTER_SAMPLES {
+    label 'process_single'
+
+    conda "conda-forge::r-base=4.2.1"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/r-base:4.2.1' :
+        'biocontainers/r-base:4.2.1' }"
+
+    input:
+    path(metadata, stageAs: 'input/*')
+    path(table, stageAs: 'input/*')
+
+    output:
+    path("metadata.tsv"), emit: metadata
+    path("table.tsv")   , emit: abundances
+    path("*.log")       , emit: log, optional: true
+    path "versions.yml" , emit: versions
+
+    script:
+    """
+    #!/usr/bin/env Rscript
+
+    # first column in meta has sample id
+    meta <- read.table( "$metadata", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
+    # column names are sample ids, but first column is asv id
+    abund <- read.table( "$table", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
+
+    # samples that arent in both files are dropped
+    meta_filtered <- meta[meta[,1] %in% colnames(abund)[2:length(colnames(abund))],]
+    abund_filtered <- abund[,colnames(abund) %in% c( colnames(abund)[1], meta[,1] ) ]
+
+    # write filtered data
+    write.table(meta_filtered, file = "metadata.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\t")
+    write.table(abund_filtered, file = "table.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\t")
+
+    # error in case all samples were removed
+    if ( nrow(meta_filtered) == 0 ) {
+        stop("All samples were removed. That means no overlap between the metadata sample IDs and the abundance table sample IDs was found. Make sure that sample IDs match.")
+    }
+
+    # this is in case some samples were lost during preprocessing, i.e. samples in metadata but not in abundance table
+    if ( nrow(meta) > nrow(meta_filtered) ) {
+        log_message = paste("The metadata file rows were reduced from", nrow(meta), "to", nrow(meta_filtered),", because some samples were missing in the abundance table")
+        write.table(log_message, file = paste0(log_message,".log"), row.names = FALSE, col.names = FALSE, quote = FALSE)
+    }
+    # this is in case some samples were not in metadata, i.e. only a subset of samples is entering downstream analysis
+    if ( ncol(abund) > ncol(abund_filtered) ) {
+        log_message = paste("Samples in the abundance file were reduced from", ncol(abund)-1, "to", ncol(abund_filtered)-1,", because the metadata did not contain all samples in the abundance table")
+        write.table(log_message, file = paste0(log_message,".log"), row.names = FALSE, col.names = FALSE, quote = FALSE)
+    }
+
+    # versions
+    writeLines(c("\\"${task.process}\\":", paste0("    R: ", paste0(R.Version()[c("major","minor")], collapse = ".")) ), "versions.yml")
+    """
+}
@@ -18,7 +18,7 @@
                     "pattern": "^\\S+\\.(tsv|csv|yml|yaml|txt)$",
                     "fa_icon": "fas fa-dna",
                     "description": "Path to tab-separated sample sheet",
-                    "help_text": "Path to sample sheet, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml), that points to compressed fastq files.\n\nThe sample sheet must have two to four tab-separated columns/entries with the following headers: \n- `sampleID` (required): Unique sample IDs, must start with a letter, and can only contain letters, numbers or underscores\n- `forwardReads` (required): Paths to (forward) reads zipped FastQ files\n- `reverseReads` (optional): Paths to reverse reads zipped FastQ files, required if the data is paired-end\n- `run` (optional): If the data was produced by multiple sequencing runs, any string\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)",
+                    "help_text": "Path to sample sheet, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml), that points to compressed fastq files.\n\nThe sample sheet must have at least two entries, the required headers are: \n- `sampleID` (required): Unique sample IDs, must start with a letter, and can only contain letters, numbers or underscores\n- `forwardReads` (required): Paths to (forward) reads zipped FastQ files\n\nOptional headers are: `reverseReads`, `run`, `control`, `quant_reading`; more details are in the usage documentation.\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)",
                     "schema": "assets/schema_input.json"
                 },
                 "input_fasta": {
@@ -35,7 +35,7 @@
                     "format": "directory-path",
                     "fa_icon": "fas fa-dna",
                     "description": "Path to folder containing zipped FastQ files",
-                    "help_text": "Path to folder containing compressed fastq files.\n\nExample for input data organization from one sequencing run with two samples, paired-end data:\n\n```bash\ndata\n  \u251c\u2500sample1_1_L001_R1_001.fastq.gz\n  \u251c\u2500sample1_1_L001_R2_001.fastq.gz\n  \u251c\u2500sample2_1_L001_R1_001.fastq.gz\n  \u2514\u2500sample2_1_L001_R2_001.fastq.gz\n```\n\nPlease note the following requirements:\n\n1. The path must be enclosed in quotes\n2. The folder must contain gzip compressed demultiplexed fastq files. If the file names do not follow the default (`\"/*_R{1,2}_001.fastq.gz\"`), please check `--extension`.\n3. Sample identifiers are extracted from file names, i.e. the string before the first underscore `_`, these must be unique\n4. If your data is scattered, produce a sample sheet\n5. All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter `--multiple_sequencing_runs` and a specific folder structure.\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- `--multiple_sequencing_runs` if the sequencing data originates from multiple sequencing runs\n- `--extension` if the sequencing file names do not follow the default (`\"/*_R{1,2}_001.fastq.gz\"`)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)"
+                    "help_text": "Path to folder containing compressed fastq files. Sample identifiers are extracted from file names, i.e. the string before the first underscore `_`, these must be unique. Examples and requirements are in the usage documentation.\n\nRelated parameters: `--extension`, `--multiple_sequencing_runs`, `--pacbio`, `--iontorrent`, `--single_end`."
                 },
                 "FW_primer": {
                     "type": "string",
@@ -53,7 +53,7 @@
                     "type": "string",
                     "format": "file-path",
                     "description": "Path to metadata sheet, when missing most downstream analysis are skipped (barplots, PCoA plots, ...).",
-                    "help_text": "This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.\n\nRelated parameter:\n- `--metadata_category` (optional) to choose columns that are used for testing significance\n\nFor example:\n\n```bash\n--metadata \"path/to/metadata.tsv\"\n```\n\nPlease note the following requirements:\n\n1. The path must be enclosed in quotes\n2. The metadata file has to follow the QIIME2 specifications (https://docs.qiime2.org/2021.2/tutorials/metadata/)\n\nThe first column in the tab-separated metadata file is the sample identifier column (required header: `ID`) and defines the sample or feature IDs associated with your study. In addition to the sample identifier column, the metadata file is required to have at least one column with multiple different non-numeric values but not all unique.\n**NB**: without additional columns there might be no groupings for the downstream analyses.\n\nSample identifiers should be 36 characters long or less, and also contain only ASCII alphanumeric characters (i.e. in the range of [a-z], [A-Z], or [0-9]), or the dash (-) character. For downstream analysis, by default all numeric columns, blanks or NA are removed, and only columns with multiple different values but not all unique are selected.\n\nThe columns which are to be assessed can be specified by `--metadata_category`. If `--metadata_category` isn't specified than all columns that fit the specification are automatically chosen.",
+                    "help_text": "This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.\n\nRelated parameter:\n- `--metadata_category` (optional) to choose columns that are used for testing significance\n\nFor example:\n\n```bash\n--metadata \"path/to/metadata.tsv\"\n```\n\nThe first column in the tab-separated metadata file is the sample identifier column (required header: `ID`) and defines the sample or feature IDs associated with your study. More details are in the usage documentation.",
                     "fa_icon": "fas fa-file-csv"
                 },
                 "multiregion": {