Skip to content

Improvements to normally distributed data #13

@JBris

Description

@JBris

The Simulated genome in this figure was generated using mode 2.

  1. A real A. thaliana genome and curated TE library was used as input files. (ps. TE library is a collection of TE consensus sequences representing each TE family. In a genome, each TE family could have multiple TE members, whose sequences may be identical to the consensus sequence or have diverged from it with various degree.)
  2. TE sequences in the Arabidopsis genome was detected by the software RepeatMasker using the script mask_TE.py. Detected TE sequences were then removed from the genome.
  3. RepeatMasker also calculate the sequence divergence of each detected TE sequences in respect to the corresponding TE consensus sequences. Sequence divergence was converted to sequence identity. (identity = 1 - divergence)
  4. The mean and sd of sequence identity of each TE family was then calculated. (line 176 of summarise_rm_out.py)
  5. The mean and sd from step 4 were compiled with other information into a table before performing simulation. (prep_sim_TE_lib_mode2.py), then were used for generating normal distribution for each TE family. (line 309 of TE_sim_random_insertion.py)
  6. Because TE sequences with higher sequence identity are more likely to be detected, the final identity for simulation is adjusted as follows: identity_fix = identity + (100 - identity) * 0.5 (line 312 of TE_sim_random_insertion.py)
  7. TE sequence identity of the real genome was acquired from step 3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions