Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 123 additions & 28 deletions doc/07_taxonomic_discovery_with_sourmash.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Sourmash doesn't have a big memory or CPU footprint, and can be run on most lapt
Below is a recommended `srun` command to start an interactive session in which to run the `srun` commands.

```
srun -p bmh -J sourmash24 -t 24:00:00 --mem=16gb -c 1 --pty bash
srun -p bmh -J sourmash24 -t 04:00:00 --mem=10gb -c 1 --pty bash
```

Let's install sourmash
Expand Down Expand Up @@ -59,60 +59,152 @@ A signature is a compressed representation of the k-mers in the sequence.
Using this data structure instead of the reads makes sourmash much faster.

```
sourmash sketch -o SRR1976948.sig --name SRR1976948 -p scaled=2000,k=21,k=31,k=51,abund SRR1976948_*fastq.gz
sourmash sketch dna -p scaled=1000,k=21,k=31,k=51,abund SRR1976948_*fastq.gz --name SRR1976948 -o SRR1976948.sig
```

The outputs file, `SRR1976948.sig` holds a representative subset of k-mers from our original sample, as well as their abundance information.
> - `sourmash sketch dna` creates a signature file
> - `-p scaled=1000,k=21,k=31,k=51,abund` are the creation parameters of the signature file
> - `SRR1976948_*fastq.gz --name SRR1976948` creates a single signature from multiple files
> - `-o SRR1976948.sig` outputs the signature to the designated name `SRR1976948.sig`

The outputs file, `SRR1976948.sig` holds a representative subset of k-mers from our original sample, as well as their abundance information.
The k-mers are "hashed", or transformed, into numbers to make selecting, storing, and looking up the k-mers more efficient.
In addition, these techniques remove the necessity to trim the raw fasta files before processing as the likelihood of finding a matching erroneous hash is statistically **extremely** unlikely.

To quickly peak at a signature file, database, or collection created by sourmash. Use:

```
sourmash sig summarize SRR1976948.sig
```

This will output:

```
== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'SRR1976948.sig'
path filetype: MultiIndex
location: SRR1976948.sig
is database? no
has manifest? yes
num signatures: 3
** examining manifest...
total hashes: 3007375
summary of sketches:
1 sketches with DNA, k=21, scaled=1000, abund 838571 total hashes
1 sketches with DNA, k=31, scaled=1000, abund 1006183 total hashes
1 sketches with DNA, k=51, scaled=1000, abund 1162621 total hashes
```

## Sourmash gather

Sourmash provides several methods for estimating the composition of known sequences in a metagenome. `sourmash gather` is the primary technique ([ref](https://www.biorxiv.org/content/10.1101/2022.01.11.475838)) -
it gives strain-level specificity to matches in its output -- e.g. all strains that match any sequences (above a threshold) in your metagenome will be reported, along with the percent of each strain that matches.
This is useful both to estimate the amount of your metagenome that is known, and to estimate the closest strain relative to the thing that is in your metagenome.

Download and unzip the sourmash database:
Download a pre-built [sourmash database](https://sourmash.readthedocs.io/en/latest/databases.html#gtdb-r08-rs214-genomic-representatives-85k):

```
curl -L https://osf.io/4f8n3/download -o genbank-k31.lca.json.gz
gunzip genbank-k31.lca.json.gz
curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.lca.json.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we're running this on farm, you can just point people at /group/ctbrowngrp/sourmash-db/gtdb-rs214!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, with gather, no need to use the LCA database (which is bigger memory, tho maybe faster). could just as easily use the regular zip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we're running this on farm, you can just point people at /group/ctbrowngrp/sourmash-db/gtdb-rs214!

Would that be a good time to introduce ln -s as well?

also, with gather, no need to use the LCA database (which is bigger memory, tho maybe faster). could just as easily use the regular zip.

I did add a Bonus goal at the end of the document to use the standard zip database, I think I should switch them around. Introduce the zip db in the document and add a bonus about using other database types...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we're running this on farm, you can just point people at /group/ctbrowngrp/sourmash-db/gtdb-rs214!

Would that be a good time to introduce ln -s as well?

sure!

also, with gather, no need to use the LCA database (which is bigger memory, tho maybe faster). could just as easily use the regular zip.

I did add a Bonus goal at the end of the document to use the standard zip database, I think I should switch them around. Introduce the zip db in the document and add a bonus about using other database types...?

zip better. Not sure why other database types needed here, but sure, in the bonus section... :)

```

And then run `gather`

```
sourmash gather -o SRR1976948_gather.csv SRR1976948.sig genbank-k31.lca.json
sourmash gather SRR1976948.sig gtdb-rs214-reps.k31.lca.json.gz -o SRR1976948_gather.csv
```

> - `sourmash gather` is the taxonomic profiler of sourmash
> - `SRR1976948.sig` is the input signature file
> - `gtdb-rs214-reps.k31.lca.json` is the database we are gathering our taxonomic profile from!
> - `-o SRR1976948_gather.csv` is a csv output file with our results

We see an output that looks like this:

```
== This is sourmash version 3.0.1. ==
== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 LCA databases. ksize=31, scaled=10000
selecting specified query k=31
loaded query: SRR1976948... (k=31)
selecting default query k=31.
loaded query: SRR1976948... (k=31, DNA)
--
loaded 85205 total signatures from 1 locations.
after selecting signatures compatible with search, 85205 remain.

Starting prefetch sweep across databases.
Prefetch found 245 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.

overlap p_query p_match avg_abund
--------- ------- ------- ---------
2.5 Mbp 5.3% 100.0% 157.0 GCA_003451675.1 Candidatus Atribacteria bacterium, ASM345167v1
2.5 Mbp 0.3% 99.6% 9.5 GCA_003446605.1 Anaerolineaceae bacterium, ASM344660v1
2.5 Mbp 2.1% 100.0% 64.1 GCA_003513005.1 Desulfotomaculum sp., ASM351300v1
2.5 Mbp 2.5% 99.2% 76.3 GCA_002503885.1 Methanoculleus marisnigri, ASM250388v1
2.3 Mbp 0.2% 76.6% 7.6 GCF_002752635.1 Mesotoga sp. Brook.08.105.5.1 strain=Brook.08....
2.2 Mbp 0.4% 84.9% 12.4 GCF_001316325.1 Methanobacterium formicicum JCM 10132 strain=J...
2.1 Mbp 0.1% 65.4% 4.5 GCA_002305765.1 Deltaproteobacteria bacterium UBA1386, ASM2305...
2.1 Mbp 12.4% 100.0% 448.3 GCA_001509375.1 Methanosaeta harundinacea, ASM150937v1

.
.
.

overlap p_query p_match
--------- ------- --------
2.5 Mbp 0.2% 96.9% unassigned Methanomicrobiales archaeon 53_19
2.4 Mbp 0.2% 99.2% Methanobacterium sp. 42_16
2.3 Mbp 0.2% 100.0% Desulfotomaculum sp. 46_80
2.3 Mbp 0.2% 100.0% unassigned Actinobacteria bacterium 66_15
2.1 Mbp 0.2% 97.7% Desulfotomaculum sp. 46_296
2.1 Mbp 0.2% 99.0% Methanosaeta harundinacea
2.0 Mbp 0.2% 99.0% unassigned Marinimicrobia bacterium 46_43
1.9 Mbp 0.2% 100.0% unassigned Bacteroidetes bacterium 38_7
1.9 Mbp 0.2% 55.1% unassigned Thermotogales bacterium EBM-48
50.0 kbp 0.0% 1.4% 3.6 GCA_021372725.1 Chloroflexi bacterium, ASM2137272v1
50.0 kbp 0.0% 1.7% 1.2 GCF_004366375.1 Halanaerobium congolense strain=DSMZ 11287, AS...
50.0 kbp 0.0% 1.2% 1.4 GCF_011174675.1 Bacteroidales bacterium M08MB strain=M08MB, AS...
found less than 50.0 kbp in common. => exiting

found 135 matches total;
the recovered matches hit 49.0% of the abundance-weighted query.
the recovered matches hit 7.8% of the query k-mers (unweighted).

WARNING: final scaled was 10000, vs query scaled of 1000
```

The first column estimates the amount of sequences in our metagenome that are contained in the match, while the second column estimates the amount of the match that is contained within our metagenome.
The first column estimates the amount of sequences in our metagenome that are contained in the matching genome, while the second column estimates the amount of the match that is contained within our metagenome.
These percentages are quite high in the `p_match` column...that's because the authors who originally analyzed this sample deposited metagenome-assembled genomes from this sample into GenBank.

When sourmash is finished running, it tells us that 94% of our sequence was unclassified; i.e. it doesn't match any sequence in the database.
When sourmash is finished running, it tells us that ~50% of our sequence was unclassified; i.e. it doesn't match any sequence in the database.
This is common for metagenomics, particularly for samples that are sequenced from rare environments (like Alaskan oil reservoirs).

## Sourmash taxonomy

A secondary technique to further analyzing the gather output and confirm our findings is the `sourmash tax` command.

Download a prepared lineage dataset:

```
curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214.lineages.csv.gz
```

Run `sourmash tax` with the gather output we just made from the metagenome signature.

```
sourmash tax metagenome -g SRR1976948_lca_gather.csv -t gtdb-rs214.lineages.csv.gz --output-format kreport
```

The third line of this output correlates to the previous gather output with ~50% of out metagenome unclassified.

```
== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results from 'SRR1976948_lca_gather.csv'.
loaded results for 1 queries from 1 gather CSVs
Starting summarization up rank(s): strain, species, genus, family, order, class, phylum, superkingdom
31.94 2410420000 0 D d__Bacteria
17.03 1285339999 0 D d__Archaea
51.03 3850760000 3850760000 U unclassified

.
.
.

```


In the next lesson, we will work to improve the percent of sequence in the metagenome that is classifiable.

## Final Thoughts
Expand All @@ -122,11 +214,14 @@ these seem to perform well (albeit with high false positive rates) in situations
Sourmash, by contrast, can estimate which known genomes are actually present, so that you can extract them and map/align to them.
It seems to have a very low false positive rate and is quite sensitive to strains.

## Bonus (optional) sourmash lca gather
## Bonus (optional) sourmash fun

Above, we ran `sourmash lca gather` on our untrimmed data.
94% of the sample did not contain sequence in any GenBank assembly.
Above, we ran `sourmash gather` on our untrimmed data.
50% of the sample did not contain sequence in any GTDB assembly.
A substantial proportion of this sequence could be due to errors.
Run `sourmash lca gather` again on your adapter and k-mer trimmed data.
Run `sourmash gather` again on your adapter and k-mer trimmed data.
How much less of the sequence is unclassifiable when the errors and adapters are removed?
How many species are no longer detected after k-mer and error trimming?

Also, we only used one type of database that was created with a specific set of parameters.
How does the gather output change with the database type and different parameters? Use curl, summarize, and gather to answer.