Skip to content

The problem and desiderata

Adrian Viehweger edited this page Mar 25, 2017 · 11 revisions

Microbial bioinformatics is detective work, to which one of its great exponents remarked:

{>>The problem.<<}

Through extensive sequencing efforts of many ecological systems, from ocean water [^seq1], invertebrate [^seq2] and {==other species==}{>>fill gap<<}, a rich diversity of viral genomes is discovered. Their host distribution is partly overlapping, with oftentimes one viral species infecting multiple hosts with varying frequencies, with overlapping hosts between viruses. {==The genome structure is highly variable, ranging from small to large and from segmented to unsegmented.==}{>>refine<<}. These discoveries oftentimes redefine taxonomic classes. From a practical view, existing datastructures need expansion to accomodate this new found diversity. In the detective work of the microbial bioinformatician, "[one's datastructures] must be as broad as Nature, if they are to interpret Nature", to paraphrase Sherlock Holmes [^scarlet].

{>>Current solutions.<<}

Current solutions lack on three aspects: integration, flexibility and availability.

{==Integration:==}{>>fragmentation etc.<<} The current multitude and heterogeneity of databases make this a very tedious exercise. Each database comes with its own format requirements, interfaces and "documentational language". Combining a dataset from multiple sources thus requires much time, love and effort. The variation in the database landscape reflects the nature of biological data, which is at best fragmented, inconsistent and simply messy. With newly emerging technologies we expect this problem to only worsen.

{==Flexibility:==}{>>nested vs not, relational vs not, new formats (see current graph assembly 2.0 format)<<} The generated data subset usually has a flat structure and does not lend itself to more than a few angles on the data. Further model iterations then usually recquire the crafting of a new data subset, which is inefficient.

{==Availability==}{>>online, sharing, easy setup, local/ offline first, remote areas w/o connection, issues on continuous support<<}

{>>Desiderata.<<}

From these shortcomings we can directly derive three desiderata of a datastructure in (viral) bioinformatics:

{==Ideally, we would like a dynamic view on a subset of the available data space, which we'll call "data hub", in accordance with the COMPARE initiative. To be useful, a data hub should satisfy three desiderata:==}{>>no ref to compare, leave out paragraph<<}

{>>@Flexibility:<<} First, it needs to support rich data structures, e.g. allowing entries to be nested and/ or linked. This is a necessity to represent biological data, which in spite of curative efforts remain mostly messy. As already mentioned, this trend towards more entropy is likely to continue, while curative resources remain flat.

{>>@Integration:<<} Second, a data hub needs a well designed API, which allows complex queries of the raw data. Ideally, it would also offer convenience functions to combine data in informative ways. For example, we would like to query sequence data by taxonomy as well as phylogeny, besides the basic aggragation based on sample metadata. The API should link to downstream analyses with third party tools, such as multiple sequence alignments and the application of machine learning techniques.

{>>@Availability:<<} Third, we likely need to share data with collaborators and other stakeholders. Thus, the data hub needs to be implemented in a highly portable way, including easy and quick setup/ deletion as well as version control. The hub is created for usage in the context of a project, and is discarded after usage, while logging all the steps needed to recreate it if need be. Furthermore, we need to create a registry of all available data hubs. An important property of being portable is the ability to work locally and offline. That is why a data hub should not require access to cloud services at all times. However, any implementation needs to be scalable too: It should not matter whether the data hub contains 1 thousand or 1 billion entries, with seamless upscaling in remote compute environments.

[^scarlet]: Sherlock Holmes, Study in scarlet {>>q<<} [^seq1]: {==q==}{>>Tara Ocean Expedition<<} [^seq2]: {==q==}{>>Redefining invertebrate RNA virome<<}

Clone this wiki locally