R scripts useful for working with the IBA data.
- Clone the repository
git clone git@github.com:insect-biome-atlas/utils.git
cd utils- If you don't have pixi installed, install it with:
curl -fsSL https://pixi.sh/install.sh | sh- Start a pixi shell
pixi shell- Test that you can run the
clean_asv_data.Rscript:
clean_asv_data.R -hThe Rscript clean_asv_data.R can be used to clean the ASV data. The script can
remove ASV clusters present in control samples (e.g. negative controls, buffer
blanks etc.) as well as clusters that represent spike-ins. The script takes the
following arguments:
-c,--counts: Path to cluster counts file
This should be a tab-separated file with ASV clusters as rows and samples as columns. This represents the raw counts of ASV clusters prior to any filtering.
Example:
| cluster | sample1 | sample2 | sample3 |
|---|---|---|---|
| ASV_cluster1 | 10 | 0 | 5 |
| ASV_cluster3 | 0 | 0 | 15 |
| ASV_cluster3 | 0 | 10 | 0 |
-f,--filtered_counts: Path to filtered cluster counts file
This should be a tab-separated file with ASV clusters as rows and samples as columns. This represents filtered counts of ASV clusters after removing noise such as NUMTs and low abundance clusters.
Example:
| cluster | sample1 | sample2 | sample3 |
|---|---|---|---|
| ASV_cluster1 | 10 | 0 | 5 |
| ASV_cluster3 | 0 | 0 | 15 |
| ASV_cluster3 | 0 | 10 | 0 |
-t,--taxonomy: Path to cluster taxonomy file
This should be a tab-separated file with ASV ids in the first column. The
file must contain a column cluster with ASV cluster designations as well
as a column representative with values either 1 or 0 indicating whether
the ASV is a representative of the cluster.
Example:
| ASV | cluster | representative | Kingdom | Phylum | Class | ... |
|---|---|---|---|---|---|---|
| ASV1 | ASV_cluster1 | 1 | Animalia | Arthropoda | Insecta | ... |
| ASV2 | ASV_cluster1 | 0 | Animalia | Arthropoda | Insecta | ... |
| ASV3 | ASV_cluster2 | 1 | Animalia | Arthropoda | Insecta | ... |
| ASV4 | ASV_cluster3 | 1 | Animalia | Arthropoda | Insecta | ... |
-m,--metadata: Path to metadata file
This should be a tab-separated file with sample ids in the first column. The
file must contain a column that designates the type of sample allowing the
script to discriminate between true samples and controls (see options
--sample_type_column, --sample_types and --control_types below). If
spike-in clusters are to be removed, there must also be a column named
spikein_sample that contains 1 or True for samples to which spike-ins were
added.
Example:
| sample | lab_sample_type | spikein_sample |
|---|---|---|
| sample1 | sample | 0 |
| sample2 | buffer_blank | 0 |
| sample3 | pcr_neg | 0 |
| sample4 | sample | 1 |
-d,--dataset: Only process samples for a specific dataset in the metadata.
The dataset argument allows you to process a subset of samples. This requires that the metadata file contains a dataset column and that you have defined datasets for the different samples, e.g.:
| sample | lab_sample_type | spikein_sample | dataset | | sample1| sample | 0 | dataset1 | | sample2| buffer_blank | 0 | dataset1 | | sample3| pcr_neg | 0 | dataset2 | | sample4| sample | 1 | dataset2 |
--sample_type_column: Column in metadata file that contains sample type (default:lab_sample_type)
Name of column that contains the sample type. This is used to discriminate between
true samples and controls. The default value is lab_sample_type.
--sample_types: Comma-separated list of sample types (default:sample)
Comma-separated list of sample types. These are the sample types that are considered
true samples and not controls. The default value is sample.
--control_types: Comma-separated list of control types (default:buffer_blank,pcr_neg,extraction_neg,buffer_blank_art_spikes)
Comma-separated list of control types. These are the sample types that are considered
controls and which are used to remove ASV clusters based on occurrence in these
samples. The default value is buffer_blank, pcr_neg, extraction_neg, buffer_blank_art_spikes.
--control_cutoff: Threshold for removing control clusters (default: 0.05)
Clusters occurring in more than control_cutoff of control samples will be
removed. The default value is 0.05.
--spikein_cutoff: Threshold for identifying spikein clusters (default: 0.8)
Clusters occurring in more than spikein_cutoff of spikein samples will be
identified as spikeins. The default value is 0.8.
--counts_outfile: Path to output file with counts of cleaned clusters (default:cleaned_filtered_counts.tsv)
This will be a tab-separated file with ASV clusters as rows and samples as columns. This file will contain the filtered counts of ASV clusters after removing noise, controls and spike-ins.
--taxonomy_outfile: Path to output file with taxonomy of cleaned clusters (default:cleaned_cluster_taxonomy.tsv)
This will be a tab-separated file with ASV ids in the first column. The file will contain the taxonomy of ASV clusters after removing noise, controls and spike-ins.
--control_outfile: Path to output file with clusters identified in control samples
This will be a tab-separated file with ASV clusters identified in control samples in the first column and with additional columns containing read statistics and taxonomic assignments.
--spikein_outfile: Path to output file with clusters identified as spike-ins
This will be a tab-separated file with ASV clusters identified as spike-ins in the first column and with additional columns containing read statistics and taxonomic assignments.
Example of usage:
clean_asv_data.R -c cluster_counts.tsv \
-f noise_filtered_cluster_counts.tsv \
-t cluster_taxonomy.tsv \
-m metadata.tsv \
--counts_outfile cleaned_filtered_counts.tsv \
--taxonomy_outfile cleaned_cluster_taxonomy.tsv