Skip to content

Whitepaper

Adrian Viehweger edited this page Mar 28, 2017 · 15 revisions

What is zoo?

zoo is a tool to compose, manipulate and share units of sequence data, so-called data cells. Their are shared through a decentralized peer-to-peer network, implementing an offline first design. Data cells are indexed in a registry, where both a cell's metadata and its sequence content can be seached (the latter via minhash signatures). zoo is highly scalable, so that it does not matter whether a cell contains 10 or 10 billion sequences.

zoo mainly intents to (in no particular order):

  • easily compose heterogenous datasets for rapid prototyping
  • disseminate data quickly, especially in cases of disease outbreaks, but allow access restrictions if need be
  • increase the amount of reproducible science through push-of-a-button access of version controlled data
  • make data more persistent because many peers can host a single dataset
  • search through data with privacy restrictions without access to the sensitive data itself

Virus wha?

Although zoo focuses on microbes, especially viruses, it is agnostic to the context of any particular sequence. A sequence is the atomic unit of information, and as such can be used to model many known (genomic) complexities [1].

A sequence is embedded in one or more schemas, which in turn make up a document.

Figure: Sequence, template, data cell.

A schema is a nested hash map and serves as template for the data cell structure. The "base" schema in zoo has four main fields and a unique identifier:

// JSON format
{
    "_id": null
    "sequence": "",
    "metadata": {},
    "relative": {}
    "derivative": {}
}

A sequence is defined as a string with an arbitrary alphabet, typically RNA, DNA or protein.

Metadata describe the way a sequence "came to be known". Where was it sampled from, who by, from which host, through which sample preparation and sequencing methods?

The relative field includes taxonomic, phylogenic and linked information. It adresses how a given sequence compares to others. Alignments and phylogenetic trees are archived here.

The derivative field summarizes or reexpresses the sequence information, e.g. via annotations, minhashes and alternative encodings like bracket-dot notation for secondary structures. Derived information is usually heavily dependent on the original sequence. For example, the annotation "open reading frame" (ORF) derives from the sequence's start and a stop codon position, which alone does not contain any useful information.
Note that all categories interact, e.g. we could use minhash signatures (derivative) to compare a sequence to other ones in the database, storing the result in a field under relative.

Schemas can be composed. For example, we can add annotations:

{
    "_id": null
    "sequence": "",
    "metadata": {},
    "relative": {}
    "derivative": {

        "annotation": [{
            "location": ['[5:10](-)', '[20:>30](?)'],
            "id": "X0001",
            "name": "RdRp",
            "source": "genbank",
            "synonyms": [RNA-dependent polymerase],
            "type": "CDS"

        }]
    }
}

As a further example, each genome segment of a segmented virus such as Influenza A would be modelled as a separate document. The documents are then linked with entries in the field "relative".

What is a data cell?

A data cell is a collection of documents. A data cell interacts with the outside world through zoo. zoo itself has two components: a (Python) library and a database engine. The engine stores and internally manages the data cell, and the library allows the user to "work on" the data through an intuitive interface, which encapsulates the data.

A set of data cells is called a data zoo.

"To work on" the data means mostly three things:

  1. compose
  2. manipulate
  3. share

Compose.

Data cells can be composed like lego blocks. E.g. a given curated data cell on flaviviruses could be updated with new experimental results such as the following (in seconds and with a single command), given the new data were available in a data cell, maybe directly attached to the publication. This would be ideal in many cases, particulary when many new (viral) species are discovered [2].

Figure: Compose.

Manipulate

A short example: Throught zoo's API, we might export a sample of sequences to a fasta file, the sample having been stratified by the information in the "host" field. Given appropriate metadata, very complex queries are possible, such as:

"All segments of virus X from 2007 onwards in a 500 km perimeter of Jena linked to Mosquito hosts, with sequences restricted to annotation Y and sequence similarity to some reference strain Z."

We feed the fasta file into our favourite multiple sequence alignment (MSA) tool, such as Mafft. Note that zoo does not care about which downstream tools are used. Mafft or MUSCLE? RAxML or BEAST? As long as an interface to the data cell is implemented in the zoo library, any application can be employed seamlessly. Back to our MSA: Because we don't want to recompute it, we can import each sequence's alignment coordinates back into the data cell. We can now delete the raw MSA file, because we can reconstruct the MSA with the information in the data cell. The same goes for trees, reference-based alignments etc.

Now we want to feed our MSA into a machine-learning algorithm to predict the "host" class based on the sequence information. zoo's library has functions both to sample a test and train sequence set, and to (e.g. one-hot-)encode the sequences into a binary matrix, commonly used by ML algorithms.

Figure: Manipulate.

Because data cells are shared with the "Dat" protocol (see below), they are implicitly version controlled at all times, and any changes to a data cell are logged.

Figure: Version control.

Share

Data cells are shared with the Dat protocol:

Dat is a new p2p [peer-to-peer] hypermedia protocol. It provides public-key- & sha256-addressed file archives which can be synced securely and browsed on-demand. Dat supports streaming updates and partial on-demand replication, and has plans for versioned URLs and efficient compaction. -- Dat protocol

Amongst its many advantages, with Dat we do not need a central repository to host data cells, but they are exchanged directly between peers.

An example: Let's say you had a file zika.json with some experimental Zika data. Dat creates a unique link for this file, through which we can share it with whoever we please.

dat share .../zika_survey/  # contains zika.json
# Syncing Dat Archive: .../zika_survey
# Link: dat://ff92ce30e1ff6ebd75edeb42f04239367243a58b7838f50706bd995e5dbc5d4c

We can send this link to a colleague or put it in the zoo registry for others to find. It is even possible to configure certain privacy restrictions, so not everybody on the network can access it.

# meanwhile in a faraway place
dat clone ff92ce30e1ff6ebd75edeb42f04239367243a58b7838f50706bd995e5dbc5d4c
ls
# zika.json

Figure: Share.

How can I discover data cells? What is this "registry"?

The registry is not necessary to exchange data cells. However, we would like to search for useful ones. The registry is a simple ledger that lists all available data cells, with some to-be-determined metadata. Wouldn't is be nice to search though the actual sequence contents of the listed data cells as well? This is made possible (to a certain extend) by including a minhash signature of all the sequences of a listed data cell in its registry metadata.

Continuing with the Zika example above: We minhashed all sequences in the Zika data cell, and then queried the resulting minhash signature against a data cell containing all viral reference genomes. Therein, a single species was identified as being similar to the Zika data cell signature: The Zika virus reference genome.

Figure: Registry.

Note that a data cell registry has not been created yet.

Where does the original sequence information come from?

zoo aims to implement common interfaces between data cells and NCBI, EBI, UniProt and other large databases.

Yeah, but does it scale?

zoo's (database) engine is provided by MongoDB, the de-facto industry standard in NoSQL databases. It provides excellent horizontal scalability.

Because MongoDB is a NoSql (document) database, its internal representation of data as documents and collections maps very well to zoo's data representation in data cells (= collections) and sequence records (= documents).

MongoDB recently added graph database functionalities, which allow rich linkage between sequence records, e.g. see an example here.

A data cell can be moved to a larger compute environment if local resources are not sufficient anymore. Note however that we can load all available Influenza A virus sequences (> 700 k) from fasta in less than 10 min, and query them in subsecond time on a typical laptop.

We plan to make zoo available as a docker container, further facilitating a transfer to larger compute environments.

What else can I do with it?

Some of the use cases we are investigating and which are feasible with zoo:

  • Query a large (metagenomic) read set against a curated data cell, and classify the reads if they are identifyable from the cell's contents. This could either be a negative filter (to exclude all identified reads) if the aim is virus discovery, or a positive one if we want to isolate e.g. RNA-dependent RNA-polymerase fragments.
  • Link and search host and pathogen sequence information in a graph-database like manner.
  • We'll think of more stuff in the near future.

Credit

Besides MongoDB, zoo uses "sourmash", a Python library for minhash signature computation and efficient search via sequence Bloom trees [3].


[1]: Stadler, P. F., Prohaska, S. J., Forst, C. V. & Krakauer, D. C. Defining genes: a computational framework. Theory Biosci 128, 165–170 (2009).
[2]: Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).
[3]: Brown, C. T. & Irber, L. sourmash: a library for MinHash sketching of DNA. The Journal of Open Source Software 1, (2016).

Clone this wiki locally