Skip to content

Commit 5addeea

Browse files
committed
update to v1.2.2 - README
1 parent b3f3839 commit 5addeea

File tree

1 file changed

+28
-1
lines changed

1 file changed

+28
-1
lines changed

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,34 @@ A more detailed example on how homologous instances of the same five-gene operon
8787

8888
## Algorithm for computing empirical P-value
8989

90-
To calculate an empirical P-value, we gather codons for each gene across the genome. First the codon frequency distribution of the focal-region/BGC genes is compared to the codon frequency distribution of the background genome (all other genes; genes which have lengths not divisible by 3 are ignored). After, we perform 10,000 simulations where in each simulation we shuffle the full genome-wide set of genes and go through the first N genes until the same number of codons as are present in the focal region/BGC/GCF are observed. The cosine distance between the observed codon frequency is compared to the remainder of the genome-wide codon distribution and checked for whether it is higher than what was actually observed for the BGC; if so, then an empirical P-value counter is appended a count of 1. The final empirical P-value produced is simply this count plus a pseudocount of 1 over 10,001.
90+
codoff uses a Monte Carlo simulation approach to calculate an empirical P-value to assess codon usage differences between the focal region(s) of interest and the background genome. The algorithm works as follows:
91+
92+
### 1. Data Preparation
93+
- Extract all CDS features from the genome (genes with lengths divisible by 3)
94+
- Calculate codon frequency distributions for:
95+
- **Focal region**: Codons from genes in the BGC/focal region
96+
- **Background genome**: All remaining genes in the genome
97+
98+
### 2. Statistical Comparison
99+
- Compute cosine distance between focal and background codon frequency distributions
100+
- Calculate Spearman correlation coefficient between the two distributions
101+
102+
### 3. Monte Carlo Simulation
103+
For each of N simulations (default: 10,000, configurable with `--num-sims`):
104+
1. **Shuffle** the complete list of all genes in the genome
105+
2. **Select genes sequentially** from the shuffled list until accumulating the same number of codons as in the actual focal region
106+
3. **Calculate codon frequencies** for this simulated "focal" region
107+
4. **Calculate background frequencies** as: `total_genome_counts - simulated_focal_counts`
108+
5. **Compute cosine distance** between simulated focal and background frequencies
109+
6. **Count** how many simulated distances ≥ observed distance
110+
111+
### 4. P-value Calculation
112+
The empirical P-value is calculated as:
113+
```
114+
P-value = (count of simulations with distance ≥ observed distance + 1) / (total simulations + 1)
115+
```
116+
117+
This approach tests whether the observed focal region's codon usage is significantly different from what would be expected if we randomly selected the same amount of coding sequence from anywhere in the genome.
91118

92119
<!---![figure](https://github.com/Kalan-Lab/codoff/blob/main/codoff_empirical_pvalue_image.png?raw=true) --->
93120

0 commit comments

Comments
 (0)