Skip to content

Whitepaper

Adrian Viehweger edited this page Mar 25, 2017 · 15 revisions

What is zoo?

zoo is a protocol to compose, manipulate and share units of sequence data, socalled data cells. Their are shared through a decentralized peer-to-peer network, implementing an offline first design. Data cells are indexed in a registry, where both a cell's metadata and its sequence content can be seached (the latter via minhash signatures). zoo is highly scalable, so that it does not matter whether a cell contains 10 or 10 billion sequences.

zoo mainly intents to (in no particular order):

  • easily compose heterogenous datasets for rapid prototyping
  • disseminate data quickly, especially in cases of disease outbreaks, but allow access restrictions if need be
  • increase the amount of reproducible science through push-of-a-button access of version controlled data
  • make data more persistent because many peers can host a single dataset
  • search through data with privacy restrictions without access to the sensitive data itself

Virus wha?

Although zoo focuses on microbes, especially viruses, it is agnostic to the context of any particular sequence. A sequence is the atomic unit of information, and as such can be used to model many known (genomic) complexities {==q==}{>>Stadtler-Prohaska gene concept.<<}. A sequence is embedded in a context that is defined by one or more schemas, which in turn make up a document.

Figure: Sequence, template, data cell.

A schema is a nested hash map and serves as template for the data cell structure. The "base" schema in zoo has four main fields and a unique identifier:

// JSON format
{
    "_id": null
    "sequence": "",
    "metadata": {},
    "relative": {}
    "derivative": {}
}

A sequence is defined as a string with an arbitrary alphabet, typically RNA, DNA or protein. Metadata describe the way a sequence "came to be known". Where was it sampled from, who by, from which host, through which sample preparation and sequencing methods?
The relative field includes taxonomic, phylogenic and linked information. It adresses how a given sequence compares to others. Alignments and phylogenetic trees are archived here.
The derivative field summarizes or reexpresses the sequence information, e.g. via annotations, minhashes and alternative encodings like bracket-dot notation for secondary structures. Derived information is usually heavily dependent on the original sequence. For example, the annotation "open reading frame" (ORF) derives from the sequence's start and a stop codon position, which alone does not contain any useful information.
Note that all categories interact, e.g. we could use minhash signatures (derivative) to compare a sequence to other ones in the database, storing the result in a field under relative. Schemas can be composed. For example, we can add annotations:

{
    "_id": null
    "sequence": "",
    "metadata": {},
    "relative": {}
    "derivative": {

        "annotation": [{
            "location": ['[5:10](-)', '[20:>30](?)'],
            "id": "X0001",
            "name": "RdRp",
            "source": "genbank",
            "synonyms": [RNA-dependent polymerase],
            "type": "CDS"

        }]
    }
}

As a further example, each genome segment of a segmented virus such as Influenza A would be modelled as a separate document. The documents are then linked with entries in the field "relative".

What is a data cell?

A data cell is a collection of documents. zoo provides two components to work with data cells: a (Python) library and a database engine. As mentioned above, "to work with" means mostly three things:

  • compose

Figure: Compose.

  • manipulate

Figure: Manipulate.

includes version control

Figure: Version control.

  • share

Figure: Share.

  • composability: allow data to be more than the sum of its pars

  • reuse and deprecation: allows to monitor how active a dataset is used (think people/ academics generating data for a living: hard to assess normally)

  • What is a data cell? A cell holds it data in JSON format. However, its exact structure is determined by schemas. Schemas are composable hash map templates and serve as structural building blocks of a data cell.

  • manipulation? interfaces that isolate/ encapsulate the data but allow maximal flexibility with tool use: as long as there is an interface implemented, a given tool can be used (so in theory all of them), nothing is baked in, really slim/ lean design: one data object, basically nested hash map operations

  • search: minhash

    • raw datasets
    • cell content
    • phyloqueries
    • privacy preserving ("walled garden" of ctb)
  • combination? because schemas, show base scheme + idea: seq, derivative, relative, ...

  • sharing: dat + register, how does indexing work? meta + else

    • what is a p2p network and how does it work
    • How does the registry work?
    • How does sharing work
  • scale? in theory easy

Figure: Registry.

  • access is random, i.e. can access parts of dataset as well, sharing can control data access, registry can define passwords/ access controls
Clone this wiki locally