Skip to content

The problem and desiderata

Adrian Viehweger edited this page Apr 4, 2017 · 11 revisions

The problem

Through extensive sequencing efforts of samples from ocean water [1] to pools of invertebrates [2], a rich diversity of viral genomes is continuously being discovered. Their genomes are staggeringly diverse in structure, and a complex and partly overlapping host distribution is a common property. New discoveries regularly redefine taxonomic classes.

From a practical point of view, existing data structures need to expand to accomodate this diversity. In the detective work of the microbial bioinformatician: "One's [data structures] must be as broad as Nature, if they are to interpret Nature.", to paraphrase Sherlock Holmes [3].

The unresolved problem is how to move data effectively:

  • between large sequence databases and downstream applications
  • between collaborators
  • through time (i.e. version control)

Effective data

For data to be effective, certain properties are commonly identified. For example, the FAIR Data Principles call for data to be findable, accessible, interoperable and reusable. Others rank desirable properties in a pyramid, with the foundation being sustainably stored data, that higher up is shared and close to the tip is trusted, which includes properties like being comprehensive, reviewed and reproducible.

The established public sequence databases of the International Nucleotide Sequence Database Collaboration (INSDC) provide comprehensive and sustainable data infrastructure for the sharing, dissemination and publication of sequence, contextual and derived data, such as genome assemblies, epidemiological data relating to isolates and functional annotation. Indeed, all major publishers and funders require ultimate publication of these data through these databases. While the public databases provide the key technologies and capacity to deal with viral sequence data, they remain generalists across all species, viral and cellular. -- Guy Cochrane, personal communication

What these databases don't provide is a format that bridges the current gap between storage and query of data and the actual data analysis.

Current solutions

Current solutions lack on three aspects: integration, flexibility and availability.

Integration: The current multitude and heterogeneity of databases make the composition and curation of multidimensional datasets a very tedious exercise. Each database comes with its own format requirements, interfaces and "documentational language". This variation in the database landscape reflects the nature of biological data, which is at best fragmented, inconsistent and simply messy. With newly emerging technologies we expect this problem to only worsen.

Flexibility: A composed dataset usually has a flat structure and does not lend itself to more than a few queries of low complexity. Further iterations in the exploratory data analysis stage usually recquire the crafting of new data sets, just in order to allow slightly different queries. This is inefficient.

Availability: Many data sets are either not made available at all, or on institutional websites. This model has a short half-life (due to loss of interest, funding or maintenance), and so links break and data is left inaccessible. Furthermore, access to large databases such as NCBI is usually rate restricted, and sometimes unavailable, effectively halting work for many.

Following is a list of projects that to some extent implement this sort of genomics data structure, that sits between large-scale databases (NCBI, EBI, ...) and the downstream analysis (which they partially include):

The latter does extensive collections of data from various sources, and some of the results are the archetypical candidate for a data cell in zoo:

This project is a collaboration between the many groups that sequenced virus genomes during the 2014-2016 epidemic in West Africa. Most of the sequence data has been deposited in GenBank but this repository represents a comprehensive data set curated and annotated by the groups involved. The data is currently being analysed with the intent to publish a paper on the spatial and temporal dynamics inferred from virus genome sequences. The analyses and the methods and scripts that underly them will be posted to this repository in the spirit of Open Science and we welcome comments. We hope that early access to this data set and the downstream products such as time-calibrated trees and inferred spatial patterns will foster further research in this area. -- source

Desiderata

From these shortcomings we can directly derive three desiderata of a data structure in microbial bioinformatics:

First, it needs have a rich data structure, e.g. allowing entries to be nested and/ or linked. This is necessary to represent biological data, which in spite of curative efforts remain mostly messy. As already mentioned, this trend towards more entropy is likely to continue, while curative resources remain flat.

Second, this data structure needs a well designed API, which allows complex queries of the raw data. Ideally, it would also offer convenience functions to combine data in informative ways. For example, we would like to query sequence data by taxonomy as well as phylogeny, besides the basic aggragation based on sample metadata. The API should link to downstream analyses with third party tools, e.g. for multiple sequence alignments and machine learning.

Third, we likely need to share data with collaborators and other stakeholders. Thus, the data structure needs to be implemented in a highly portable way, including easy and quick setup/ deletion as well as version control. The data structure is created for usage in the context of a project, and can be discarded after usage, while logging all the steps needed to recreate it if need be. Furthermore, we need to create a registry of all these available data collections.
An important property of being portable is the ability to work locally and offline. That is why a data hub should not require access to cloud services at all times. However, any implementation needs to be scalable too: It should not matter whether the data collection contains 1 thousand or 1 billion entries, with seamless upscaling in remote compute environments.

Data packages - a first step

However, some progress has been made in the design of a format that fulfills these desiderata. The concept of a Data package has the potential to bridge the gap between data storage and analysis:

Data Package is a simple container format used to describe and package a collection of data. The format provides a simple contract for data interoperability that supports frictionless delivery, installation and management of data.

Data Packages can be used to package any kind of data. At the same time, for specific common data types such as tabular data is has support for providing important additional descriptive metadata – for example, describing the columns and data types in a CSV.

The following core principles inform our approach:

  • Simplicity
  • Extensibility and customisation by design
  • Metadata that is human-editable and machine-usable
  • Reuse of existing standard formats for data
  • Language, technology and infrastructure agnostic

Incorporating ideas from the data package design, we implement the data cell, which can be manipulated through a toolset called zoo. This is described next in our whitepaper.


[1]: Brum, J. R. et al. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498 (2015).
[2]: Shi, M. et al. Redefining the invertebrate RNA virosphere. Nature 540, 539–543 (2016).
[3]: Doyle, A. C. A Study in Scarlet, Part 1, Chapter 5 (1887).

Clone this wiki locally