Skip to content

Commit c36625c

Browse files
authored
Merge pull request #949 from d4straub/add-metadata-filter
Add metadata filter
2 parents bcefe64 + 1daf72d commit c36625c

File tree

10 files changed

+99
-123
lines changed

10 files changed

+99
-123
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
### `Added`
99

1010
- [#948](https://github.com/nf-core/ampliseq/pull/948) - Decontam as optional decontamination tool.
11+
- [#949](https://github.com/nf-core/ampliseq/pull/949) - The dataset can be filtered for downstream analysis with the metadata sheet, for example to remove negative control samples meant for Decontam.
1112

1213
### `Changed`
1314

conf/test_failed.config

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,6 @@ params {
3737
ignore_failed_filtering = true
3838

3939
//this is to remove low abundance ASVs to reduce runtime of downstream processes
40-
min_samples = 2
4140
min_frequency = 10
4241

4342
// Skipping steps

docs/usage.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,10 +298,20 @@ For example, the tab-separated `regions_multiregion.tsv` may contain:
298298
| region4 | GGAGCATGTGGWTTAATTCGA | CGTTGCGGGACTTAACCC | 115 |
299299
| region5 | GGAGGAAGGTGGGGATGAC | AAGGCCCGGGAACGTATT | 150 |
300300

301+
> [!WARNING]
302+
> Several downstream filtering options are not allowed or disabled when analysing multi region data.
303+
> Disabled functions are any ASV postprocessing/filtering options that require sequences and also no
304+
> sample subsetting using the metadata sheet is available (i.e. if provided, the metadata sheet has
305+
> to include all samples that pass preprocessing).
306+
301307
### Metadata
302308

303309
Metadata is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.
304310

311+
> [!TIP]
312+
> The metadata defines what samples are entering downstream analysis. For example, when having negative controls in the samplesheet,
313+
> those can be omitted in the metadata sheet and will not enter downstream analysis with QIIME2.
314+
305315
```bash
306316
--metadata "path/to/metadata.tsv"
307317
```

modules/local/filter_samples.nf

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
process FILTER_SAMPLES {
2+
label 'process_single'
3+
4+
conda "conda-forge::r-base=4.2.1"
5+
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
6+
'https://depot.galaxyproject.org/singularity/r-base:4.2.1' :
7+
'biocontainers/r-base:4.2.1' }"
8+
9+
input:
10+
path(metadata, stageAs: 'input/*')
11+
path(table, stageAs: 'input/*')
12+
13+
output:
14+
path("metadata.tsv"), emit: metadata
15+
path("table.tsv") , emit: abundances
16+
path("*.log") , emit: log, optional: true
17+
path "versions.yml" , emit: versions
18+
19+
script:
20+
"""
21+
#!/usr/bin/env Rscript
22+
23+
# first column in meta has sample id
24+
meta <- read.table( "$metadata", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
25+
# column names are sample ids, but first column is asv id
26+
abund <- read.table( "$table", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
27+
28+
# samples that arent in both files are dropped
29+
meta_filtered <- meta[meta[,1] %in% colnames(abund)[2:length(colnames(abund))],]
30+
abund_filtered <- abund[,colnames(abund) %in% c( colnames(abund)[1], meta[,1] ) ]
31+
32+
# write filtered data
33+
write.table(meta_filtered, file = "metadata.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\t")
34+
write.table(abund_filtered, file = "table.tsv", row.names = FALSE, col.names = TRUE, quote = FALSE, na = '', sep = "\t")
35+
36+
# error in case all samples were removed
37+
if ( nrow(meta_filtered) == 0 ) {
38+
stop("All samples were removed. That means no overlap between the metadata sample IDs and the abundance table sample IDs was found. Make sure that sample IDs match.")
39+
}
40+
41+
# this is in case some samples were lost during preprocessing, i.e. samples in metadata but not in abundance table
42+
if ( nrow(meta) > nrow(meta_filtered) ) {
43+
log_message = paste("The metadata file rows were reduced from", nrow(meta), "to", nrow(meta_filtered),", because some samples were missing in the abundance table")
44+
write.table(log_message, file = paste0(log_message,".log"), row.names = FALSE, col.names = FALSE, quote = FALSE)
45+
}
46+
# this is in case some samples were not in metadata, i.e. only a subset of samples is entering downstream analysis
47+
if ( ncol(abund) > ncol(abund_filtered) ) {
48+
log_message = paste("Samples in the abundance file were reduced from", ncol(abund)-1, "to", ncol(abund_filtered)-1,", because the metadata did not contain all samples in the abundance table")
49+
write.table(log_message, file = paste0(log_message,".log"), row.names = FALSE, col.names = FALSE, quote = FALSE)
50+
}
51+
52+
# versions
53+
writeLines(c("\\"${task.process}\\":", paste0(" R: ", paste0(R.Version()[c("major","minor")], collapse = ".")) ), "versions.yml")
54+
"""
55+
}

nextflow_schema.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
"pattern": "^\\S+\\.(tsv|csv|yml|yaml|txt)$",
1919
"fa_icon": "fas fa-dna",
2020
"description": "Path to tab-separated sample sheet",
21-
"help_text": "Path to sample sheet, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml), that points to compressed fastq files.\n\nThe sample sheet must have two to four tab-separated columns/entries with the following headers: \n- `sampleID` (required): Unique sample IDs, must start with a letter, and can only contain letters, numbers or underscores\n- `forwardReads` (required): Paths to (forward) reads zipped FastQ files\n- `reverseReads` (optional): Paths to reverse reads zipped FastQ files, required if the data is paired-end\n- `run` (optional): If the data was produced by multiple sequencing runs, any string\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)",
21+
"help_text": "Path to sample sheet, either tab-separated (.tsv), comma-separated (.csv), or in YAML format (.yml/.yaml), that points to compressed fastq files.\n\nThe sample sheet must have at least two entries, the required headers are: \n- `sampleID` (required): Unique sample IDs, must start with a letter, and can only contain letters, numbers or underscores\n- `forwardReads` (required): Paths to (forward) reads zipped FastQ files\n\nOptional headers are: `reverseReads`, `run`, `control`, `quant_reading`; more details are in the usage documentation.\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)",
2222
"schema": "assets/schema_input.json"
2323
},
2424
"input_fasta": {
@@ -35,7 +35,7 @@
3535
"format": "directory-path",
3636
"fa_icon": "fas fa-dna",
3737
"description": "Path to folder containing zipped FastQ files",
38-
"help_text": "Path to folder containing compressed fastq files.\n\nExample for input data organization from one sequencing run with two samples, paired-end data:\n\n```bash\ndata\n \u251c\u2500sample1_1_L001_R1_001.fastq.gz\n \u251c\u2500sample1_1_L001_R2_001.fastq.gz\n \u251c\u2500sample2_1_L001_R1_001.fastq.gz\n \u2514\u2500sample2_1_L001_R2_001.fastq.gz\n```\n\nPlease note the following requirements:\n\n1. The path must be enclosed in quotes\n2. The folder must contain gzip compressed demultiplexed fastq files. If the file names do not follow the default (`\"/*_R{1,2}_001.fastq.gz\"`), please check `--extension`.\n3. Sample identifiers are extracted from file names, i.e. the string before the first underscore `_`, these must be unique\n4. If your data is scattered, produce a sample sheet\n5. All sequencing data should originate from one sequencing run, because processing relies on run-specific error models that are unreliable when data from several sequencing runs are mixed. Sequencing data originating from multiple sequencing runs requires additionally the parameter `--multiple_sequencing_runs` and a specific folder structure.\n\nRelated parameters are:\n- `--pacbio` and `--iontorrent` if the sequencing data is PacBio data or IonTorrent data (default expected: paired-end Illumina data)\n- `--single_end` if the sequencing data is single-ended Illumina data (default expected: paired-end Illumina data)\n- `--multiple_sequencing_runs` if the sequencing data originates from multiple sequencing runs\n- `--extension` if the sequencing file names do not follow the default (`\"/*_R{1,2}_001.fastq.gz\"`)\n- Choose an appropriate reference taxonomy for the type of amplicon (16S/18S/ITS/CO1) (default: DADA2 assignTaxonomy and 16S rRNA sequence database)"
38+
"help_text": "Path to folder containing compressed fastq files. Sample identifiers are extracted from file names, i.e. the string before the first underscore `_`, these must be unique. Examples and requirements are in the usage documentation.\n\nRelated parameters: `--extension`, `--multiple_sequencing_runs`, `--pacbio`, `--iontorrent`, `--single_end`."
3939
},
4040
"FW_primer": {
4141
"type": "string",
@@ -53,7 +53,7 @@
5353
"type": "string",
5454
"format": "file-path",
5555
"description": "Path to metadata sheet, when missing most downstream analysis are skipped (barplots, PCoA plots, ...).",
56-
"help_text": "This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.\n\nRelated parameter:\n- `--metadata_category` (optional) to choose columns that are used for testing significance\n\nFor example:\n\n```bash\n--metadata \"path/to/metadata.tsv\"\n```\n\nPlease note the following requirements:\n\n1. The path must be enclosed in quotes\n2. The metadata file has to follow the QIIME2 specifications (https://docs.qiime2.org/2021.2/tutorials/metadata/)\n\nThe first column in the tab-separated metadata file is the sample identifier column (required header: `ID`) and defines the sample or feature IDs associated with your study. In addition to the sample identifier column, the metadata file is required to have at least one column with multiple different non-numeric values but not all unique.\n**NB**: without additional columns there might be no groupings for the downstream analyses.\n\nSample identifiers should be 36 characters long or less, and also contain only ASCII alphanumeric characters (i.e. in the range of [a-z], [A-Z], or [0-9]), or the dash (-) character. For downstream analysis, by default all numeric columns, blanks or NA are removed, and only columns with multiple different values but not all unique are selected.\n\nThe columns which are to be assessed can be specified by `--metadata_category`. If `--metadata_category` isn't specified than all columns that fit the specification are automatically chosen.",
56+
"help_text": "This is optional, but for performing downstream analysis such as barplots, diversity indices or differential abundance testing, a metadata file is essential.\n\nRelated parameter:\n- `--metadata_category` (optional) to choose columns that are used for testing significance\n\nFor example:\n\n```bash\n--metadata \"path/to/metadata.tsv\"\n```\n\nThe first column in the tab-separated metadata file is the sample identifier column (required header: `ID`) and defines the sample or feature IDs associated with your study. More details are in the usage documentation.",
5757
"fa_icon": "fas fa-file-csv"
5858
},
5959
"multiregion": {

tests/default.nf.test.snap

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)