Skip to content

Commit 5b2280b

Browse files
committed
adding detail on file format and parsing
1 parent aee7539 commit 5b2280b

File tree

7 files changed

+194
-1
lines changed

7 files changed

+194
-1
lines changed

docs/assets/attributes_diagram.png

548 KB
Loading

docs/assets/records_diagram.pdf

25.4 KB
Binary file not shown.

docs/assets/records_diagram.png

25.6 KB
Loading

docs/formats_and_parsing.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
**File types and Parsing behaviour**
2+
3+
**Genbank (.gbk) & Embl (.embl)**
4+
5+
6+
- We also provide the ability to convert these formats to gff3
7+
8+
Each genome file is a basic text file following a specific ruleset and genbank (gbk) and embl are similar but differ in the required formatting.
9+
10+
11+
*Structure and Parsing*
12+
13+
Top Level is the Records type. There is one Records type per genome file.
14+
Three types of macro have getters and setters for the data, these are SourceAttributes, FeatureAttributes, SequenceAttributes.
15+
16+
The next level is the Record type. There may be one or many Record in a Records (up to ~ 2000 but more usually ~ 50). Each Record has a DNA sequence which is calculated on the fly by slicing the total sequence of Records with the start and stop coordinates. Each Record also has a SourceAttributes macro which stores ID, total start and stop of the Record sequence (different to the CDS features start and stop below). It also stores the Organism among some other database comments.
17+
18+
![explanatory diagram for the file datatypes](assets/records_diagram.png){ loading=lazy }
19+
20+
The full structure of the SourceAttributes, FeatureAttributes and SequenceAttributes is:
21+
22+
![explanatory diagram for the Attribute macros](assets/attributes_diagram.png){ width=500 }
23+
24+
SourceAttributes stores the following in an enum:
25+
26+
```
27+
pub enum SourceAttributes {
28+
Start { value: RangeValue },
29+
Stop { value: RangeValue },
30+
Organism { value: String },
31+
MolType { value: String},
32+
Strain { value: String},
33+
CultureCollection { value: String},
34+
TypeMaterial { value: String},
35+
DbXref { value:String}
36+
}
37+
```
38+
39+
Where RangeValue can be either of:
40+
41+
```
42+
RangeValue::Exact(value)
43+
RangeValue::LessThan(value)
44+
RangeValue::GreaterThan(value)
45+
```
46+
47+
Most RangeValues are Exact(value) with exceptions usually at the start and end of sequences, indicating that they are truncated.
48+
49+
Note that the start and stop of SourceAttributes relate to the sequence of the whole record
50+
51+
Each Record can have None or many hundreds of coding sequences, CDS (stored in the FeatureAttributes).
52+
These are the predicted genes and contain annotation data per gene such as locus_tag (id), gene (may be empty), start, stop, strand (-1 or +1), codon start (1,2 or 3), product.
53+
54+
FeatureAttributes stores the following in an enum:
55+
56+
```
57+
pub enum FeatureAttributes {
58+
Start { value: RangeValue },
59+
Stop { value: RangeValue },
60+
Gene { value: String },
61+
Product { value: String },
62+
CodonStart { value: u8 },
63+
Strand { value: i8 },
64+
// ec_number { value: String }
65+
}
66+
```
67+
68+
currently EC_number is commented out but could be added back if there is demand
69+
70+
Each CDS also has a DNA sequence .ffn (calculated on the fly from the start, stop and strand) and a protein sequence .faa (translated on the fly from the start, stop, strand and codon_start). The sequences are stored in the SequenceAttributes.
71+
72+
SequenceAttributes stores the following in an enum:
73+
74+
```
75+
pub enum SequenceAttributes {
76+
Start { value: RangeValue },
77+
Stop { value: RangeValue },
78+
SequenceFfn { value: String },
79+
SequenceFaa { value: String },
80+
CodonStart { value: u8 },
81+
Strand { value: i8 },
82+
}
83+
```
84+
85+
Note the start, stop, strand and codon start of SequenceAttributes and FeatureAttributes are identical
86+
87+
Sequences are stored separately in SequenceAttributes for efficiency. Although start, stop, locus_tag, and strand are duplicated in SequenceAttributes and FeatureAttributes, keeping them together may make it easier to slice the sequence and access the specific feature metadata at the same time.
88+
89+
90+
91+
92+
93+
94+
95+
96+
97+
98+

docs/formats_and_parsing.md~

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
**File types and Parsing behaviour**
2+
3+
**Genbank (.gbk) & Embl (.embl)**
4+
5+
6+
- We also provide the ability to convert these formats to gff3
7+
8+
Each genome file is a basic text file following a specific ruleset and genbank (gbk) and embl are similar but differ in the required formatting.
9+
10+
11+
*Structure and Parsing*
12+
13+
Top Level is the Records type. There is one Records type per genome file.
14+
Three types of macro have getters and setters for the data, these are SourceAttributes, FeatureAttributes, SequenceAttributes.
15+
16+
The next level is the Record type. There may be one or many Record in a Records (up to ~ 2000 but more usually ~ 50). Each Record has a DNA sequence which is calculated on the fly by slicing the total sequence of Records with the start and stop coordinates. Each Record also has a SourceAttributes macro which stores ID, total start and stop of the Record sequence (different to the CDS features start and stop below). It also stores the Organism among some other database comments.
17+
18+
![explanatory diagram for the file datatypes](assets/records_diagram.png){ loading=lazy }
19+
20+
SourceAttributes stores the following in an enum:
21+
22+
```
23+
pub enum SourceAttributes {
24+
Start { value: RangeValue },
25+
Stop { value: RangeValue },
26+
Organism { value: String },
27+
MolType { value: String},
28+
Strain { value: String},
29+
CultureCollection { value: String},
30+
TypeMaterial { value: String},
31+
DbXref { value:String}
32+
}
33+
```
34+
35+
Where RangeValue can be either of:
36+
37+
```
38+
RangeValue::Exact(value)
39+
RangeValue::LessThan(value)
40+
RangeValue::GreaterThan(value)
41+
```
42+
43+
Most RangeValues are Exact(value) with exceptions usually at the start and end of sequences, indicating that they are truncated.
44+
45+
Note that the start and stop of SourceAttributes relate to the sequence of the whole record
46+
47+
Each Record can have None or many hundreds of coding sequences, CDS (stored in the FeatureAttributes).
48+
These are the predicted genes and contain annotation data per gene such as locus_tag (id), gene (may be empty), start, stop, strand (-1 or +1), codon start (1,2 or 3), product.
49+
50+
FeatureAttributes stores the following in an enum:
51+
52+
```
53+
pub enum FeatureAttributes {
54+
Start { value: RangeValue },
55+
Stop { value: RangeValue },
56+
Gene { value: String },
57+
Product { value: String },
58+
CodonStart { value: u8 },
59+
Strand { value: i8 },
60+
// ec_number { value: String }
61+
}
62+
```
63+
64+
currently EC_number is commented out but could be added back if there is demand
65+
66+
Each CDS also has a DNA sequence .ffn (calculated on the fly from the start, stop and strand) and a protein sequence .faa (translated on the fly from the start, stop, strand and codon_start). The sequences are stored in the SequenceAttributes.
67+
68+
SequenceAttributes stores the following in an enum:
69+
70+
```
71+
pub enum SequenceAttributes {
72+
Start { value: RangeValue },
73+
Stop { value: RangeValue },
74+
SequenceFfn { value: String },
75+
SequenceFaa { value: String },
76+
CodonStart { value: u8 },
77+
Strand { value: i8 },
78+
}
79+
```
80+
81+
Note the start, stop, strand and codon start of SequenceAttributes and FeatureAttributes are identical
82+
83+
Sequences are stored separately in SequenceAttributes for efficiency. Although start, stop, locus_tag, and strand are duplicated in SequenceAttributes and FeatureAttributes, keeping them together may make it easier to slice the sequence and access its metadata at the same time.
84+
85+
86+
87+
88+
89+
90+
91+
92+
93+
94+

docs/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ For Linux:
2626
For MacOSX:
2727
```export DYLD_LIBRARY_PATH=[directory where python is located]```
2828

29-
For Windows please see this StackOverflow issue for a fix: [fix here](https://stackoverflow.com/questions/79627918/cant-set-python-version-when-running-rust-analyzer-and-pyo3-on-wsl/79627921#79627921)
29+
For Windows please see this StackOverflow issue for a fix: [fix here](https://stackoverflow.com/questions/79627918/cant-set-python-version-when-running-rust-analyzer-and-pyo3-on-wsl/79627921#79627921) and please also see our specific windows_install page
3030

3131

3232

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ nav:
2424
- Installation: installation.md
2525
- Windows Install: windows_install.md
2626
- Usage: usage.md
27+
- Formats & Parsing: formats_and_parsing.md
2728
features:
2829
- navigation.tabs
2930
markdown_extensions:

0 commit comments

Comments
 (0)