Improvements to normally distributed data

The Simulated genome in this figure was generated using mode 2. 
1.	A real A. thaliana genome and curated TE library was used as input files. (ps. TE library is a collection of TE consensus sequences representing each TE family. In a genome, each TE family could have multiple TE members, whose sequences may be identical to the consensus sequence or have diverged from it with various degree.)
2.	TE sequences in the Arabidopsis genome was detected by the software RepeatMasker using the script mask_TE.py. Detected TE sequences were then removed from the genome.
3.	RepeatMasker also calculate the sequence divergence of each detected TE sequences in respect to the corresponding TE consensus sequences. Sequence divergence was converted to sequence identity. (identity = 1 - divergence)
4.	The mean and sd of sequence identity of each TE family was then calculated. (line 176 of summarise_rm_out.py)
5.	The mean and sd from step 4 were compiled with other information into a table before performing simulation. (prep_sim_TE_lib_mode2.py), then were used for generating normal distribution for each TE family. (line 309 of TE_sim_random_insertion.py)
6.	Because TE sequences with higher sequence identity are more likely to be detected, the final identity for simulation is adjusted as follows: `identity_fix = identity + (100 - identity) * 0.5` (line 312 of TE_sim_random_insertion.py)
7.	TE sequence identity of the real genome was acquired from step 3.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to normally distributed data #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvements to normally distributed data #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions