Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 37 additions & 38 deletions paper/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Fundamentally, Gen's data model is similar to most graph models where sequences
connecting sequences in various orientations. Gen expands on this common model by defining an edge as a connection
between positions and strands within nodes. This enables Gen to model changes in an append-only mode, where changes to
the graph require only addition of data. This is a significant advantage, as the traditional model requires splitting
nodes into sub-nodes when changes are made, introducing computational overhead for many operations (@fig:graph_model).
nodes into sub-nodes when changes are made, introducing overhead for many operations (@fig:graph_model).

![**Overview of Graph data model**\
**a.** Traditional graph where sequences are encoded as nodes and edges connect two nodes together.
Expand All @@ -70,11 +70,12 @@ from the start coordinate of the region being deleted to the end coordinate of t
The first 2 rows of edges correspond to the addition of the TG sequence and the 3rd edge corresponds to the deletion.
](graph_updates/final.svg){#fig:graph_updates}

For modeling real world data, such as chromosomes, Gen represents a contig using one directed
graph of nodes and edges. This grouping is termed a Block Group and a join table between Block Groups and Edges is used
to record all the edges in a block group. Because of the additive nature of insertions, replacements, and deletions,
block groups can only grow over time. The block group represents all possible sequences that can be generated from the
graph.
For a set of sequences, such as a reference genome of multiple chromosomes, Gen represents a contig using one directed
graph of nodes and edges. Each set of nodes and edges per contig is termed a Block Group and a join table between
Block Groups and Edges is used to record all the edges in a block group. Because of the additive nature of insertions,
replacements, and deletions, block groups can only grow over time. The block group represents all possible sequences
that can be generated from a chromosome's graph. Edges between block groups can exist, which can serve to model events
such as translocations.

## Graph Traversal

Expand Down Expand Up @@ -178,21 +179,17 @@ their sequences using a GUI and leverage Gen for tracking of work and graph base

## Sample Lineage

Gen represents lineage across updates and other operations using samples. For instance, importing a fasta file into a
new collection creates the null sample with block groups for each contig. If we then update that collection with a vcf
file, gen creates new samples specified in the vcf, and any new sample in the vcf is created as a child of the null
sample. Gen copies any graph data created by the fasta import in the null sample over to each child sample before
applying the updates specified for each sample.

Using samples seems to be the best fit for tracking lineage, instead of say updating block groups with new information
within an existing sample. A sample can be used to capture the two main use cases for gen operations: Intended edits,
and observed variations. An intended edit can be represented by updating an existing sample using say a fasta file plus
a region and coordinates. That operation would capture the integration of a cassette, and create a new sample with a
path that represents the updated sequence. That new sample would be "virtual" in the sense that it doesn't exist in the
real world, but is treated by gen as separate from any real world samples, which seems appropriate. Observed variations
can be represented by updating an existing sample using say a vcf file. That operation will create new samples, each of
which represents variations observed during sequencing. In this way, gen spans both intended and unintended genetic
changes, and can be used to track changes across repeated cycles of development and testing.
Gen represents lineage across updates and other operations using samples. A sample is a simple text identifier for a
set of block groups. The null sample is used to represent non-sample data, such as a reference genome or proteome.
For example, importing the human reference genome will create a set of block groups under the null sample. If sample
data from vcfs is then added, new samples will be created and derived from the corresponding contigs belonging to the
null sample. If nested changed are made, an explicit parent sample can be specified to support iterative engineering.

A common use of samples is for tracking intended engineered and observed variation. This addresses a common problem
in genetic engineering where only the intended modification is tracked and background variation is not. A sample can
be created with the intended engineering, and then updated with any additional mutations when needed. In this way,
gen spans both intended and unintended genetic changes, and can be used to track changes across repeated cycles of
development and testing.

## Pooling

Expand All @@ -216,30 +213,35 @@ and individual operations can be applied across branches, similar to the git che
](operations_view/final.svg){#fig:operations_view}

A set of operations can be collected into a patch, which is analogous to the git patch which represents a diff of how
the codebase is changed (@fig:dot_example). However, due to the purely
additive data model of
Gen, diffs are much simpler to create as there are no rewrites. Patches are stored as a gzip file and can be shared to
distribute changes. Viewing of patches is possible via the patch-view command, which will render a dot graph of changes
within the patch. By commiting these patches and changes into git, this workflow enables many features common to
software development such as code review and continuous integration testing.
the codebase is changed (@fig:dot_example). However, due to the purely additive data model of Gen, diffs are much
simpler to create as there are no rewrites. Patches are stored as a gzip file and can be shared to distribute changes.
Viewing of patches is possible via the patch-view command, which will render a dot graph of changes within the patch.
By commiting these patches and changes into git, this workflow enables many features common to software development
such as code review and continuous integration testing.

![**Changes to a sample created by patch-view and viewed in GraphViz.**\
*a.* A simple basepair change, a sequence swap, and a deletion are shown. A dashed line indicates the path along the
reference sequence whereas a solid line indicates paths adding sequences from other non-reference sources.
**b.** Viewing a combinatorial assembly of 6 parts across two
segments.](dot_example/final.svg){#fig:dot_example}

## Viewing Operations and Graphs

Demo of the view ui and operations editing/etc.

## Translating coordinate schemes

Annotations can be propagated through the graph structure. Thus, annotations on the reference genome can be translated
into the coordinates of new samples. Coordinates are translated with the following rules:

- If the start and end of the annotation range on the reference map to valid coordinates on the sample sequence, gen
translates the entire annotation to the later sample. Gen does this even if there is a gap in the middle of the later
sequence that doesn't correspond to anything in the reference sequence, say if a subsequence were replaced.
- If say the start of the range translates to a valid coordinate on the later sample, but the end does not (say because
a replacement wiped out the stretch of the reference sequence that contained the end), gen truncates the translated
annotation range to only include the stretch of sequence that the reference shares with the later sample.
translates the entire annotation to the later sample.
- If the start and end positions are not present on the child sequence, the annotation is truncated to only include
the shared positions.

Annotations may also be translated using node identifiers instead of contig names, which we believe may be useful
for overlaying annotations in graph vizualization software. Currently, BED and GFF formats are supported for
propagation and translation.

## Exports

Expand Down Expand Up @@ -285,13 +287,14 @@ for importing the reference human genome and updating it with variant data from
Table: my caption {#tbl:test}

| Task | Time | Storage |
| ----------------------------------------------------------------------------- | ----- | ------- |
|-------------------------------------------------------------------------------| ----- | ------- |
| Importing GrCh38 (shallow) | 7s | 496kb |
| Importing GrCh38 (full) | 49s | 3.0 Gb |
| Adding variants from chr22 of 1000 genomes project (HG00096 sample) | 7.7s | 21.8 Mb |
| Adding variants from chr22 of 1000 genomes project (HG00097 sample) | 7.2s | 21.7 Mb |
| Adding variants from chr22 of 1000 genomes project (HG00096 + HG00097 sample) | 14.9s | 32.6 Mb |
| Translating xxx from A to B (gff coordinate example) | xxxs | yyy Mb |
| Propagating xxx from A to B (gff coordinate example) | xxxs | yyy Mb |
| Translating GenCode xxx for chr22 | xxxs | yyy Mb |
| Export of GFA stuff | xxxs | yyy Mb |

Shallow imports of data record only minimal information and do not store the genome sequence. This allows the database
Expand All @@ -302,10 +305,6 @@ For updating with variant data, the size of each vcf file was approximately 17.1
within the same graph can lead to improved data compression as variants shared between samples require less data to be
created.

translating annotations

data export time

# Discussion

- The lack of a growable database. Minor changes required regenerating the entire graph and re-exporting it. Graph
Expand Down
Binary file modified paper/main.pdf
Binary file not shown.