Skip to content

Commit 46b199f

Browse files
committed
adding docs
0 parents  commit 46b199f

15 files changed

+754
-0
lines changed

docs/assets/BIO W.png

54.8 KB
Loading

docs/assets/MICROBIO B.svg

Lines changed: 190 additions & 0 deletions
Loading

docs/assets/attributes_diagram.png

548 KB
Loading

docs/assets/pc_specs.png

157 KB
Loading

docs/assets/records_diagram.pdf

25.4 KB
Binary file not shown.

docs/assets/records_diagram.png

25.6 KB
Loading

docs/assets/system_model.png

55.8 KB
Loading

docs/assets/window_code.png

165 KB
Loading

docs/formats_and_parsing.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
**File types and Parsing behaviour**
2+
3+
**Genbank (.gbk) & Embl (.embl)**
4+
5+
6+
- We also provide the ability to convert these formats to gff3
7+
8+
Each genome file is a basic text file following a specific ruleset and genbank (gbk) and embl are similar but differ in the required formatting.
9+
10+
11+
*Structure and Parsing*
12+
13+
Top Level is the Records type. There is one Records type per genome file.
14+
Three types of macro have getters and setters for the data, these are SourceAttributes, FeatureAttributes, SequenceAttributes.
15+
16+
The next level is the Record type. There may be one or many Record in a Records (up to ~ 2000 but more usually ~ 50). Each Record has a DNA sequence which is calculated on the fly by slicing the total sequence of Records with the start and stop coordinates. Each Record also has a SourceAttributes macro which stores ID, total start and stop of the Record sequence (different to the CDS features start and stop below). It also stores the Organism among some other database comments.
17+
18+
![explanatory diagram for the file datatypes](assets/records_diagram.png){ loading=lazy }
19+
20+
The full structure of the SourceAttributes, FeatureAttributes and SequenceAttributes is:
21+
22+
![explanatory diagram for the Attribute macros](assets/attributes_diagram.png){ width=500 }
23+
24+
SourceAttributes stores the following in an enum:
25+
26+
```
27+
pub enum SourceAttributes {
28+
Start { value: RangeValue },
29+
Stop { value: RangeValue },
30+
Organism { value: String },
31+
MolType { value: String},
32+
Strain { value: String},
33+
CultureCollection { value: String},
34+
TypeMaterial { value: String},
35+
DbXref { value:String}
36+
}
37+
```
38+
39+
Where RangeValue can be either of:
40+
41+
```
42+
RangeValue::Exact(value)
43+
RangeValue::LessThan(value)
44+
RangeValue::GreaterThan(value)
45+
```
46+
47+
Most RangeValues are Exact(value) with exceptions usually at the start and end of sequences, indicating that they are truncated.
48+
49+
Note that the start and stop of SourceAttributes relate to the sequence of the whole record
50+
51+
Each Record can have None or many hundreds of coding sequences, CDS (stored in the FeatureAttributes).
52+
These are the predicted genes and contain annotation data per gene such as locus_tag (id), gene (may be empty), start, stop, strand (-1 or +1), codon start (1,2 or 3), product.
53+
54+
FeatureAttributes stores the following in an enum:
55+
56+
```
57+
pub enum FeatureAttributes {
58+
Start { value: RangeValue },
59+
Stop { value: RangeValue },
60+
Gene { value: String },
61+
Product { value: String },
62+
CodonStart { value: u8 },
63+
Strand { value: i8 },
64+
// ec_number { value: String }
65+
}
66+
```
67+
68+
currently EC_number is commented out but could be added back if there is demand
69+
70+
Each CDS also has a DNA sequence .ffn (calculated on the fly from the start, stop and strand) and a protein sequence .faa (translated on the fly from the start, stop, strand and codon_start). The sequences are stored in the SequenceAttributes.
71+
72+
SequenceAttributes stores the following in an enum:
73+
74+
```
75+
pub enum SequenceAttributes {
76+
Start { value: RangeValue },
77+
Stop { value: RangeValue },
78+
SequenceFfn { value: String },
79+
SequenceFaa { value: String },
80+
CodonStart { value: u8 },
81+
Strand { value: i8 },
82+
}
83+
```
84+
85+
Note the start, stop, strand and codon start of SequenceAttributes and FeatureAttributes are identical
86+
87+
Sequences are stored separately in SequenceAttributes for efficiency. Although start, stop, locus_tag, and strand are duplicated in SequenceAttributes and FeatureAttributes, keeping them together may make it easier to slice the sequence and access the specific feature metadata at the same time.
88+
89+
90+
91+
92+
93+
94+
95+
96+
97+
98+

docs/index.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Welcome to micro**BioRust**
2+
3+
A blazing-fast, sustainable bioinformatics toolkit written in [Rust](https://www.rust-lang.org/) — for microbial genomics rresearch, and optimised for functions used in data exploration.
4+
5+
---
6+
7+
## Features
8+
9+
- 🦀 Built in Rust programming language for speed and safety
10+
- 🔄 Python bindings _via_ pyo3 for InterOp - Rust meets Python
11+
- 📦 Open source and community-driven
12+
13+
---
14+
15+
## Get Started!!
16+
See Installation for details on how to install Rust for Linux, MacOSX and Windows
17+
Interested in microbiorust-py? Check out the microbiorust-py section for quick-start & more!
18+
19+
Start a new project
20+
```cargo new microBioRust_test```
21+
22+
Add to your Cargo.toml
23+
```cargo add -p microBioRust```
24+
25+
to add the whole workspace including file parsing, sequence metrics, coming soon data viz (heatmap demonstration) and python bindings (microbiorust-py)
26+
27+
```cargo add -p seqmetrics```
28+
```cargo add -p heatmap```
29+
```cargo add -p microbiorust-py```
30+
31+
or clone the repo
32+
```git clone https://github.com/LCrossman/microBioRust.git```
33+
34+
Build the project
35+
```cargo build```
36+
37+
Run the tests
38+
```cargo test```
39+

0 commit comments

Comments
 (0)