The Simulated genome in this figure was generated using mode 2.
- A real A. thaliana genome and curated TE library was used as input files. (ps. TE library is a collection of TE consensus sequences representing each TE family. In a genome, each TE family could have multiple TE members, whose sequences may be identical to the consensus sequence or have diverged from it with various degree.)
- TE sequences in the Arabidopsis genome was detected by the software RepeatMasker using the script mask_TE.py. Detected TE sequences were then removed from the genome.
- RepeatMasker also calculate the sequence divergence of each detected TE sequences in respect to the corresponding TE consensus sequences. Sequence divergence was converted to sequence identity. (identity = 1 - divergence)
- The mean and sd of sequence identity of each TE family was then calculated. (line 176 of summarise_rm_out.py)
- The mean and sd from step 4 were compiled with other information into a table before performing simulation. (prep_sim_TE_lib_mode2.py), then were used for generating normal distribution for each TE family. (line 309 of TE_sim_random_insertion.py)
- Because TE sequences with higher sequence identity are more likely to be detected, the final identity for simulation is adjusted as follows:
identity_fix = identity + (100 - identity) * 0.5 (line 312 of TE_sim_random_insertion.py)
- TE sequence identity of the real genome was acquired from step 3.
The Simulated genome in this figure was generated using mode 2.
identity_fix = identity + (100 - identity) * 0.5(line 312 of TE_sim_random_insertion.py)