Skip to content

Commit fcc2a78

Browse files
committed
chp2 methods
1 parent 2175a74 commit fcc2a78

File tree

1 file changed

+31
-8
lines changed

1 file changed

+31
-8
lines changed

thesis/02-index.Rmd

Lines changed: 31 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -442,17 +442,40 @@ and approaches for increasing the resilience and shareability of biological
442442
sequencing data,
443443
described in Chapter [5](#chp-decentralizing).
444444

445-
<!--
446445
## Methods
447446

448447
### Implementation
449448

450-
Focused on the user experience via the command-line interface and Python API,
451-
it implemented the core data structures in C++ for efficiency and exposed it to
452-
Python with an extension (written in Cython).
453-
The Python API allows fast prototyping of new ideas and interoperability with
454-
the larger scientific Python ecosystem,
455-
as well as access to better tooling for testing and software distribution.
449+
`sourmash` is a software package implemented in Python for the command-line
450+
interface and API for data exploration,
451+
and Rust for the core data structures and performance improvements.
452+
453+
Both _Scaled_ and regular _MinHash_ sketches are available,
454+
calculated using the _MurmurHash3_ hash function
455+
(lower 64-bits from the 128-bits version) with a $seed=42$
456+
and stored in a sorted vector in memory.
457+
Serialization and deserialization to JSON is implemented using the `serde` crate,
458+
and sketches also support abundance tracking for the hashes.
459+
460+
The _LCA_ and _MHBT_ indices are implemented at the Python level,
461+
and the _MHBT_ supports multiple storage backends
462+
(hidden dir, Zip files, IPFS and Redis)
463+
depending on the use case requirements.
464+
The _MHBT_ is implemented as a specialization of an _SBT_,
465+
replacing the Bloom Filters in the leaf nodes from the latter with _Scaled MinHash_
466+
sketches.
456467

457468
### Experiments
458-
-->
469+
470+
Experiments are implemented in `snakemake` workflows and use `conda` for
471+
managing dependencies,
472+
allowing reproducibility of the results with one command:
473+
`snakemake --use-conda`.
474+
This will download all data,
475+
install dependencies and generate the data used for analysis.
476+
477+
The analysis and figure generation code is contained in a Jupyter Notebook,
478+
and can be executed in any place where it is supported,
479+
including in a local installation or using Binder,
480+
a service that deploy a live Jupyter environment in cloud instances.
481+
Instructions are available at https://doi.org/10.5281/zenodo.4012667

0 commit comments

Comments
 (0)