The problem and desiderata

The problem

Through extensive sequencing efforts of samples from ocean water [1] to pools of invertebrates [2], a rich diversity of viral genomes is continuously being discovered. Their genomes are staggeringly diverse in structure, and a complex and partly overlapping host distribution is a common property. New discoveries regularly redefine taxonomic classes.

From a practical point of view, existing data structures need to expand to accomodate this diversity. In the detective work of the microbial bioinformatician: "One's [data structures] must be as broad as Nature, if they are to interpret Nature.", to paraphrase Sherlock Holmes [3].

Current solutions

Current solutions lack on three aspects: integration, flexibility and availability.

Integration: The current multitude and heterogeneity of databases make the composition and curation of multidimensional datasets a very tedious exercise. Each database comes with its own format requirements, interfaces and "documentational language". This variation in the database landscape reflects the nature of biological data, which is at best fragmented, inconsistent and simply messy. With newly emerging technologies we expect this problem to only worsen.

Flexibility: A composed dataset usually has a flat structure and does not lend itself to more than a few queries of low complexity. Further iterations in the exploratory data analysis stage usually recquire the crafting of new data sets, just in order to allow slightly different queries. This is inefficient.

Availability: Many data sets are either not made available at all, or on institutional websites. This model has a short half-life (due to loss of interest, funding or maintenance), and so links break and data is left inaccessible. Furthermore, access to large databases such as NCBI is usually rate restricted, and sometimes unavailable, effectively halting work for many.

Desiderata

From these shortcomings we can directly derive three desiderata of a data structure in microbial bioinformatics:

First, it needs have a rich data structure, e.g. allowing entries to be nested and/ or linked. This is necessary to represent biological data, which in spite of curative efforts remain mostly messy. As already mentioned, this trend towards more entropy is likely to continue, while curative resources remain flat.

Second, this data structure needs a well designed API, which allows complex queries of the raw data. Ideally, it would also offer convenience functions to combine data in informative ways. For example, we would like to query sequence data by taxonomy as well as phylogeny, besides the basic aggragation based on sample metadata. The API should link to downstream analyses with third party tools, e.g. for multiple sequence alignments and machine learning.

Third, we likely need to share data with collaborators and other stakeholders. Thus, the data structure needs to be implemented in a highly portable way, including easy and quick setup/ deletion as well as version control. The data structure is created for usage in the context of a project, and can be discarded after usage, while logging all the steps needed to recreate it if need be. Furthermore, we need to create a registry of all these available data collections.
An important property of being portable is the ability to work locally and offline. That is why a data hub should not require access to cloud services at all times. However, any implementation needs to be scalable too: It should not matter whether the data collection contains 1 thousand or 1 billion entries, with seamless upscaling in remote compute environments.

We implement such a structure, called data cell, which can be manipulated through a toolset called zoo, as we describe in a whitepaper.

[1]: Brum, J. R. et al. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).
[2]: Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).
[3]: Doyle, A. C. A Study in Scarlet, Part 1, Chapter 5 (1887).

Basics

Home
The problem and desiderata
Whitepaper
CLI
Implementation
Work with documents
Work with schemas
Version control
Encodings
Features:
- EDA
- Sample
- Phylogeny
- Taxonomy
- Network
FAQ

Recipes

...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The problem and desiderata

The problem

Current solutions

Desiderata

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Basics

Recipes

Clone this wiki locally