You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+28-1Lines changed: 28 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,7 +87,34 @@ A more detailed example on how homologous instances of the same five-gene operon
87
87
88
88
## Algorithm for computing empirical P-value
89
89
90
-
To calculate an empirical P-value, we gather codons for each gene across the genome. First the codon frequency distribution of the focal-region/BGC genes is compared to the codon frequency distribution of the background genome (all other genes; genes which have lengths not divisible by 3 are ignored). After, we perform 10,000 simulations where in each simulation we shuffle the full genome-wide set of genes and go through the first N genes until the same number of codons as are present in the focal region/BGC/GCF are observed. The cosine distance between the observed codon frequency is compared to the remainder of the genome-wide codon distribution and checked for whether it is higher than what was actually observed for the BGC; if so, then an empirical P-value counter is appended a count of 1. The final empirical P-value produced is simply this count plus a pseudocount of 1 over 10,001.
90
+
codoff uses a Monte Carlo simulation approach to calculate an empirical P-value to assess codon usage differences between the focal region(s) of interest and the background genome. The algorithm works as follows:
91
+
92
+
### 1. Data Preparation
93
+
- Extract all CDS features from the genome (genes with lengths divisible by 3)
94
+
- Calculate codon frequency distributions for:
95
+
-**Focal region**: Codons from genes in the BGC/focal region
96
+
-**Background genome**: All remaining genes in the genome
97
+
98
+
### 2. Statistical Comparison
99
+
- Compute cosine distance between focal and background codon frequency distributions
100
+
- Calculate Spearman correlation coefficient between the two distributions
101
+
102
+
### 3. Monte Carlo Simulation
103
+
For each of N simulations (default: 10,000, configurable with `--num-sims`):
104
+
1.**Shuffle** the complete list of all genes in the genome
105
+
2.**Select genes sequentially** from the shuffled list until accumulating the same number of codons as in the actual focal region
106
+
3.**Calculate codon frequencies** for this simulated "focal" region
5.**Compute cosine distance** between simulated focal and background frequencies
109
+
6.**Count** how many simulated distances ≥ observed distance
110
+
111
+
### 4. P-value Calculation
112
+
The empirical P-value is calculated as:
113
+
```
114
+
P-value = (count of simulations with distance ≥ observed distance + 1) / (total simulations + 1)
115
+
```
116
+
117
+
This approach tests whether the observed focal region's codon usage is significantly different from what would be expected if we randomly selected the same amount of coding sequence from anywhere in the genome.
0 commit comments