Skip to content

Commit a1af4c5

Browse files
realmarcinclaude
andcommitted
Merge bakta branch: Add Bakta, COG, and KEGG transforms
Merged bakta branch into master, adding three new genome annotation transforms: - Bakta Transform: Genome annotation processing (510 lines, 14 tests) - COG Transform: Functional groups classification (261 lines, 13 tests) - KEGG Transform: KEGG Orthology mapping (255 lines, 11 tests) Conflict resolutions: - download.yaml: Kept both MicroMediaParam mappings (from master) and COG/KEGG downloads (from bakta) - bacdive.py: Kept master version (organism_id parameter) - ontologies_transform.py: Kept master version (includes UBERON and FOODON ontologies) Total changes: 57 files, 67,295 insertions, 7,236 deletions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
2 parents e2861c4 + 27fdfbb commit a1af4c5

27 files changed

+66719
-1492
lines changed

data/transformed/cog/edges.tsv

Lines changed: 5369 additions & 0 deletions
Large diffs are not rendered by default.

data/transformed/cog/nodes.tsv

Lines changed: 5091 additions & 0 deletions
Large diffs are not rendered by default.

data/transformed/kegg/edges.tsv

Lines changed: 51915 additions & 0 deletions
Large diffs are not rendered by default.

download.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,3 +255,20 @@
255255
-
256256
url: https://github.com/CultureBotAI/MicroMediaParam/raw/main/pipeline_output/merge_mappings/compound_mappings_strict_final_hydrate.tsv
257257
local_name: compound_mappings_strict_hydrate.tsv
258+
259+
#
260+
# COG (Clusters of Orthologous Groups)
261+
#
262+
-
263+
url: https://ftp.ncbi.nlm.nih.gov/pub/COG/COG2024/data/cog-24.def.tab
264+
local_name: cog/cog-24.def.tab
265+
-
266+
url: https://ftp.ncbi.nlm.nih.gov/pub/COG/COG2024/data/cog-24.fun.tab
267+
local_name: cog/cog-24.fun.tab
268+
269+
#
270+
# KEGG (Kyoto Encyclopedia of Genes and Genomes)
271+
#
272+
-
273+
url: https://rest.kegg.jp/list/ko
274+
local_name: kegg/ko_list.txt

kg_microbe/transform.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,20 @@
66

77
from kg_microbe.transform_utils.bacdive.bacdive import BacDiveTransform
88
from kg_microbe.transform_utils.bactotraits.bactotraits import BactoTraitsTransform
9+
from kg_microbe.transform_utils.bakta.bakta import BaktaTransform
10+
from kg_microbe.transform_utils.cog.cog import COGTransform
911
from kg_microbe.transform_utils.constants import (
1012
BACDIVE,
1113
BACTOTRAITS,
14+
BAKTA,
15+
COG,
16+
KEGG,
1217
MADIN_ETAL,
1318
MEDIADIVE,
1419
ONTOLOGIES,
1520
RHEAMAPPINGS,
1621
)
22+
from kg_microbe.transform_utils.kegg.kegg import KEGGTransform
1723
from kg_microbe.transform_utils.madin_etal.madin_etal import MadinEtAlTransform
1824
from kg_microbe.transform_utils.mediadive.mediadive import MediaDiveTransform
1925
from kg_microbe.transform_utils.ontologies.ontologies_transform import (
@@ -33,6 +39,9 @@
3339
# "STRINGTransform": STRINGTransform,
3440
ONTOLOGIES: OntologiesTransform,
3541
BACDIVE: BacDiveTransform,
42+
BAKTA: BaktaTransform,
43+
COG: COGTransform,
44+
KEGG: KEGGTransform,
3645
MEDIADIVE: MediaDiveTransform,
3746
MADIN_ETAL: MadinEtAlTransform,
3847
RHEAMAPPINGS: RheaMappingsTransform,
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Bakta Genome Annotations Transform
2+
3+
This transform processes Bakta genome annotation files and converts them to KGX format for integration into KG-Microbe.
4+
5+
## Setup
6+
7+
### 1. SAMN to NCBITaxon Mapping
8+
9+
The transform requires a mapping file to convert SAMN (BioSample) IDs to NCBITaxon IDs:
10+
11+
```bash
12+
# Generate the mapping (this queries NCBI and takes ~15-30 minutes for 145 genomes)
13+
# Note: Biopython is already installed as a dependency
14+
poetry run python kg_microbe/transform_utils/bakta/create_samn_mapping.py \
15+
--input data/raw/pfas_bakta/bakta \
16+
--output kg_microbe/transform_utils/bakta/samn_to_ncbitaxon.tsv \
17+
18+
19+
# To resume if interrupted:
20+
poetry run python kg_microbe/transform_utils/bakta/create_samn_mapping.py \
21+
--input data/raw/pfas_bakta/bakta \
22+
--output kg_microbe/transform_utils/bakta/samn_to_ncbitaxon.tsv \
23+
24+
--resume
25+
```
26+
27+
**Notes**:
28+
- NCBI requires an email address for API access
29+
- Set `NCBI_API_KEY` environment variable for higher rate limits (10 req/sec vs 3 req/sec)
30+
- The script uses manual XML parsing to avoid Biopython DTD/Schema parsing issues
31+
- Get an API key from: https://www.ncbi.nlm.nih.gov/account/settings/
32+
33+
### 2. GO Ontology Setup (Optional but Recommended)
34+
35+
For accurate GO term aspect mapping (biological_process vs molecular_function vs cellular_component), set up the GO ontology:
36+
37+
#### Option A: Convert OWL to SQLite (Recommended for Performance)
38+
39+
```bash
40+
# Install semsql if not available
41+
pip install oaklib[semsql]
42+
43+
# Convert GO OWL to SQLite (run once, takes a few minutes)
44+
runoak -i data/raw/go.owl dump -o data/raw/go.db -O sql
45+
```
46+
47+
#### Option B: Use OBO Format
48+
49+
Add to `download.yaml`:
50+
```yaml
51+
-
52+
url: http://purl.obolibrary.org/obo/go.obo
53+
local_name: go.obo
54+
```
55+
56+
Then update `constants.py`:
57+
```python
58+
GO_SOURCE = RAW_DATA_DIR / "go.obo"
59+
```
60+
61+
#### Option C: Skip GO Aspect Mapping
62+
63+
The transform will work without the GO ontology - all GO terms will default to `biolink:MolecularActivity` and use the `enables` predicate. This is acceptable but less precise than proper aspect mapping.
64+
65+
## Usage
66+
67+
### Transform Bakta Annotations
68+
69+
```bash
70+
# Transform all Bakta genomes
71+
poetry run kg transform -s bakta
72+
73+
# Output will be in data/transformed/bakta/
74+
# - nodes.tsv (~1.09M nodes)
75+
# - edges.tsv (~4M edges)
76+
```
77+
78+
### Run Tests
79+
80+
```bash
81+
# Run Bakta-specific tests
82+
poetry run pytest tests/test_bakta.py -v
83+
84+
# Run all tests with quality checks
85+
poetry run tox
86+
```
87+
88+
## Data Structure
89+
90+
### Input
91+
92+
Bakta genome annotations in `data/raw/pfas_bakta/bakta/`:
93+
```
94+
bakta/
95+
├── SAMN00103324/
96+
│ ├── SAMN00103324.bakta.tsv (main annotation file)
97+
│ ├── SAMN00103324.bakta.gff3
98+
│ ├── SAMN00103324.bakta.faa
99+
│ └── ...
100+
├── SAMN00117502/
101+
└── ... (145 total genomes)
102+
```
103+
104+
### Output
105+
106+
KGX TSV files:
107+
- **nodes.tsv**: Organism, Gene, Protein, GO, EC, COG, KEGG nodes
108+
- **edges.tsv**: Relationships between entities
109+
110+
## Annotation Coverage
111+
112+
Based on analysis of ~145 genomes with ~580,000 genes:
113+
114+
| Annotation Type | Coverage | Count | Biolink Category |
115+
|----------------|----------|-------|------------------|
116+
| UniRef/RefSeq | 86-100% | ~580K | Protein IDs |
117+
| GO Terms | ~66% | ~385K | BiologicalProcess, MolecularActivity, CellularComponent |
118+
| COG Groups | ~38% | ~220K | GeneFamily |
119+
| EC Numbers | ~17% | ~99K | MolecularActivity |
120+
| KEGG KOs | ~12% | ~70K | GeneFamily |
121+
122+
## Node Types
123+
124+
- `biolink:OrganismTaxon` - Bacterial organisms (via SAMN → NCBITaxon mapping)
125+
- `biolink:Gene` - Genes with composite IDs (e.g., `SAMN00139461:JEECHJ_00005`)
126+
- `biolink:Protein` - Proteins (RefSeq preferred, UniRef50 fallback)
127+
- `biolink:BiologicalProcess` - GO biological processes
128+
- `biolink:MolecularActivity` - GO molecular functions and EC numbers
129+
- `biolink:CellularComponent` - GO cellular components
130+
- `biolink:GeneFamily` - COG functional groups and KEGG orthologs
131+
132+
## Edge Types
133+
134+
- Organism `biolink:has_gene` Gene (`RO:0002551`)
135+
- Gene `biolink:has_gene_product` Protein (`RO:0002205`)
136+
- Protein `biolink:enables` MolecularActivity (`RO:0002327`)
137+
- Protein `biolink:involved_in` BiologicalProcess (`RO:0002331`)
138+
- Protein `biolink:located_in` CellularComponent (`RO:0001025`)
139+
- Gene `biolink:member_of` COG (`RO:0002350`)
140+
- Gene `biolink:orthologous_to` KEGG (`RO:HOM0000017`)
141+
142+
## Troubleshooting
143+
144+
### "no such table: rdfs_label_statement" Error
145+
146+
This means the ontology SQLite database wasn't properly created. Solutions:
147+
148+
1. **Convert OWL to SQLite** (see Setup section above)
149+
2. **Let OAK auto-detect format** - the transform will handle this automatically
150+
3. **Use without ontology** - the transform will default all GO terms to molecular_function
151+
152+
The transform has been updated to handle missing GO ontology gracefully.
153+
154+
### NCBI API Rate Limits
155+
156+
When creating the SAMN mapping:
157+
- Without API key: 3 requests/second (default)
158+
- With API key: 10 requests/second
159+
160+
Set `NCBI_API_KEY` environment variable:
161+
```bash
162+
export NCBI_API_KEY="your_api_key_here"
163+
```
164+
165+
Get an API key from: https://www.ncbi.nlm.nih.gov/account/settings/
166+
167+
### Memory Requirements
168+
169+
Processing 145 genomes with ~580K genes requires:
170+
- Minimum: 8 GB RAM
171+
- Recommended: 16 GB RAM
172+
173+
For very large datasets, consider processing in batches.
174+
175+
## Files
176+
177+
- `bakta.py` - Main BaktaTransform class
178+
- `utils.py` - Helper functions for parsing and processing
179+
- `create_samn_mapping.py` - Script to generate SAMN → NCBITaxon mappings
180+
- `samn_to_ncbitaxon.tsv` - Mapping file (generated by script)
181+
- `tmp/` - Temporary processing files
182+
183+
## Integration with KG-Microbe
184+
185+
The Bakta transform is registered in `kg_microbe/transform.py` and configured in `merge.yaml`. It integrates with:
186+
187+
- **BacDive** - May share overlapping SAMN/organism IDs
188+
- **Ontologies** - Uses GO, EC terms from ontology transforms
189+
- **MediaDive** - Complementary organism-level data
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Bakta genome annotations transform."""
2+
3+
from .bakta import BaktaTransform
4+
5+
__all__ = ["BaktaTransform"]

0 commit comments

Comments
 (0)