|
| 1 | +# Bakta Genome Annotations Transform |
| 2 | + |
| 3 | +This transform processes Bakta genome annotation files and converts them to KGX format for integration into KG-Microbe. |
| 4 | + |
| 5 | +## Setup |
| 6 | + |
| 7 | +### 1. SAMN to NCBITaxon Mapping |
| 8 | + |
| 9 | +The transform requires a mapping file to convert SAMN (BioSample) IDs to NCBITaxon IDs: |
| 10 | + |
| 11 | +```bash |
| 12 | +# Generate the mapping (this queries NCBI and takes ~15-30 minutes for 145 genomes) |
| 13 | +# Note: Biopython is already installed as a dependency |
| 14 | +poetry run python kg_microbe/transform_utils/bakta/create_samn_mapping.py \ |
| 15 | + --input data/raw/pfas_bakta/bakta \ |
| 16 | + --output kg_microbe/transform_utils/bakta/samn_to_ncbitaxon.tsv \ |
| 17 | + |
| 18 | + |
| 19 | +# To resume if interrupted: |
| 20 | +poetry run python kg_microbe/transform_utils/bakta/create_samn_mapping.py \ |
| 21 | + --input data/raw/pfas_bakta/bakta \ |
| 22 | + --output kg_microbe/transform_utils/bakta/samn_to_ncbitaxon.tsv \ |
| 23 | + |
| 24 | + --resume |
| 25 | +``` |
| 26 | + |
| 27 | +**Notes**: |
| 28 | +- NCBI requires an email address for API access |
| 29 | +- Set `NCBI_API_KEY` environment variable for higher rate limits (10 req/sec vs 3 req/sec) |
| 30 | +- The script uses manual XML parsing to avoid Biopython DTD/Schema parsing issues |
| 31 | +- Get an API key from: https://www.ncbi.nlm.nih.gov/account/settings/ |
| 32 | + |
| 33 | +### 2. GO Ontology Setup (Optional but Recommended) |
| 34 | + |
| 35 | +For accurate GO term aspect mapping (biological_process vs molecular_function vs cellular_component), set up the GO ontology: |
| 36 | + |
| 37 | +#### Option A: Convert OWL to SQLite (Recommended for Performance) |
| 38 | + |
| 39 | +```bash |
| 40 | +# Install semsql if not available |
| 41 | +pip install oaklib[semsql] |
| 42 | + |
| 43 | +# Convert GO OWL to SQLite (run once, takes a few minutes) |
| 44 | +runoak -i data/raw/go.owl dump -o data/raw/go.db -O sql |
| 45 | +``` |
| 46 | + |
| 47 | +#### Option B: Use OBO Format |
| 48 | + |
| 49 | +Add to `download.yaml`: |
| 50 | +```yaml |
| 51 | +- |
| 52 | + url: http://purl.obolibrary.org/obo/go.obo |
| 53 | + local_name: go.obo |
| 54 | +``` |
| 55 | +
|
| 56 | +Then update `constants.py`: |
| 57 | +```python |
| 58 | +GO_SOURCE = RAW_DATA_DIR / "go.obo" |
| 59 | +``` |
| 60 | + |
| 61 | +#### Option C: Skip GO Aspect Mapping |
| 62 | + |
| 63 | +The transform will work without the GO ontology - all GO terms will default to `biolink:MolecularActivity` and use the `enables` predicate. This is acceptable but less precise than proper aspect mapping. |
| 64 | + |
| 65 | +## Usage |
| 66 | + |
| 67 | +### Transform Bakta Annotations |
| 68 | + |
| 69 | +```bash |
| 70 | +# Transform all Bakta genomes |
| 71 | +poetry run kg transform -s bakta |
| 72 | +
|
| 73 | +# Output will be in data/transformed/bakta/ |
| 74 | +# - nodes.tsv (~1.09M nodes) |
| 75 | +# - edges.tsv (~4M edges) |
| 76 | +``` |
| 77 | + |
| 78 | +### Run Tests |
| 79 | + |
| 80 | +```bash |
| 81 | +# Run Bakta-specific tests |
| 82 | +poetry run pytest tests/test_bakta.py -v |
| 83 | +
|
| 84 | +# Run all tests with quality checks |
| 85 | +poetry run tox |
| 86 | +``` |
| 87 | + |
| 88 | +## Data Structure |
| 89 | + |
| 90 | +### Input |
| 91 | + |
| 92 | +Bakta genome annotations in `data/raw/pfas_bakta/bakta/`: |
| 93 | +``` |
| 94 | +bakta/ |
| 95 | +├── SAMN00103324/ |
| 96 | +│ ├── SAMN00103324.bakta.tsv (main annotation file) |
| 97 | +│ ├── SAMN00103324.bakta.gff3 |
| 98 | +│ ├── SAMN00103324.bakta.faa |
| 99 | +│ └── ... |
| 100 | +├── SAMN00117502/ |
| 101 | +└── ... (145 total genomes) |
| 102 | +``` |
| 103 | + |
| 104 | +### Output |
| 105 | + |
| 106 | +KGX TSV files: |
| 107 | +- **nodes.tsv**: Organism, Gene, Protein, GO, EC, COG, KEGG nodes |
| 108 | +- **edges.tsv**: Relationships between entities |
| 109 | + |
| 110 | +## Annotation Coverage |
| 111 | + |
| 112 | +Based on analysis of ~145 genomes with ~580,000 genes: |
| 113 | + |
| 114 | +| Annotation Type | Coverage | Count | Biolink Category | |
| 115 | +|----------------|----------|-------|------------------| |
| 116 | +| UniRef/RefSeq | 86-100% | ~580K | Protein IDs | |
| 117 | +| GO Terms | ~66% | ~385K | BiologicalProcess, MolecularActivity, CellularComponent | |
| 118 | +| COG Groups | ~38% | ~220K | GeneFamily | |
| 119 | +| EC Numbers | ~17% | ~99K | MolecularActivity | |
| 120 | +| KEGG KOs | ~12% | ~70K | GeneFamily | |
| 121 | + |
| 122 | +## Node Types |
| 123 | + |
| 124 | +- `biolink:OrganismTaxon` - Bacterial organisms (via SAMN → NCBITaxon mapping) |
| 125 | +- `biolink:Gene` - Genes with composite IDs (e.g., `SAMN00139461:JEECHJ_00005`) |
| 126 | +- `biolink:Protein` - Proteins (RefSeq preferred, UniRef50 fallback) |
| 127 | +- `biolink:BiologicalProcess` - GO biological processes |
| 128 | +- `biolink:MolecularActivity` - GO molecular functions and EC numbers |
| 129 | +- `biolink:CellularComponent` - GO cellular components |
| 130 | +- `biolink:GeneFamily` - COG functional groups and KEGG orthologs |
| 131 | + |
| 132 | +## Edge Types |
| 133 | + |
| 134 | +- Organism `biolink:has_gene` Gene (`RO:0002551`) |
| 135 | +- Gene `biolink:has_gene_product` Protein (`RO:0002205`) |
| 136 | +- Protein `biolink:enables` MolecularActivity (`RO:0002327`) |
| 137 | +- Protein `biolink:involved_in` BiologicalProcess (`RO:0002331`) |
| 138 | +- Protein `biolink:located_in` CellularComponent (`RO:0001025`) |
| 139 | +- Gene `biolink:member_of` COG (`RO:0002350`) |
| 140 | +- Gene `biolink:orthologous_to` KEGG (`RO:HOM0000017`) |
| 141 | + |
| 142 | +## Troubleshooting |
| 143 | + |
| 144 | +### "no such table: rdfs_label_statement" Error |
| 145 | + |
| 146 | +This means the ontology SQLite database wasn't properly created. Solutions: |
| 147 | + |
| 148 | +1. **Convert OWL to SQLite** (see Setup section above) |
| 149 | +2. **Let OAK auto-detect format** - the transform will handle this automatically |
| 150 | +3. **Use without ontology** - the transform will default all GO terms to molecular_function |
| 151 | + |
| 152 | +The transform has been updated to handle missing GO ontology gracefully. |
| 153 | + |
| 154 | +### NCBI API Rate Limits |
| 155 | + |
| 156 | +When creating the SAMN mapping: |
| 157 | +- Without API key: 3 requests/second (default) |
| 158 | +- With API key: 10 requests/second |
| 159 | + |
| 160 | +Set `NCBI_API_KEY` environment variable: |
| 161 | +```bash |
| 162 | +export NCBI_API_KEY="your_api_key_here" |
| 163 | +``` |
| 164 | + |
| 165 | +Get an API key from: https://www.ncbi.nlm.nih.gov/account/settings/ |
| 166 | + |
| 167 | +### Memory Requirements |
| 168 | + |
| 169 | +Processing 145 genomes with ~580K genes requires: |
| 170 | +- Minimum: 8 GB RAM |
| 171 | +- Recommended: 16 GB RAM |
| 172 | + |
| 173 | +For very large datasets, consider processing in batches. |
| 174 | + |
| 175 | +## Files |
| 176 | + |
| 177 | +- `bakta.py` - Main BaktaTransform class |
| 178 | +- `utils.py` - Helper functions for parsing and processing |
| 179 | +- `create_samn_mapping.py` - Script to generate SAMN → NCBITaxon mappings |
| 180 | +- `samn_to_ncbitaxon.tsv` - Mapping file (generated by script) |
| 181 | +- `tmp/` - Temporary processing files |
| 182 | + |
| 183 | +## Integration with KG-Microbe |
| 184 | + |
| 185 | +The Bakta transform is registered in `kg_microbe/transform.py` and configured in `merge.yaml`. It integrates with: |
| 186 | + |
| 187 | +- **BacDive** - May share overlapping SAMN/organism IDs |
| 188 | +- **Ontologies** - Uses GO, EC terms from ontology transforms |
| 189 | +- **MediaDive** - Complementary organism-level data |
0 commit comments