KBase Genome Converter

A Python script that creates KBase Genome objects from multiple sources:

Fetches genome data from the BV-BRC (formerly PATRIC) API
Loads and validates local genome JSON files
Creates synthetic genomes by merging multiple source genomes

Features

BV-BRC API Mode

Fetches genome metadata, contigs, and features from BV-BRC API
Converts to KBase Genome object format
Handles pagination for large feature sets
Retrieves DNA and protein sequences for all features
Captures ontology terms (GO, FIGFAM, PGFAM, PLFAM, RefSeq)
Creates CDS features automatically for protein-coding genes

Local Genome Mode

Loads genome from local JSON files
Validates and ensures KBase format compliance
Fills in missing required fields with sensible defaults

Synthetic Genome Mode

Merges multiple source genomes into a single synthetic genome
Collects unique functions across all source genomes
Calculates feature probabilities (frequency across source genomes)
Creates one contig per unique function
Computes average GC content
Generates genome MD5 from sorted protein sequences
Adapted from notebook code for ASV genome creation

General Features

Saves genome as JSON file locally
Optional FASTA file generation with feature sequences
Does not require KBase authentication or upload
Backward compatible with original command-line interface

Requirements

pip install requests numpy

Note: numpy is only required for synthetic genome mode (calculating average GC content).

Installation

For Command-Line Use

Simply download the script and run it directly.

For Notebook/Library Use

# Option 1: Add to Python path
import sys
sys.path.append('/path/to/GenomeImporter')
from bvbrc_to_kbase_genome import *

# Option 2: Run in notebook
%run bvbrc_to_kbase_genome.py

# Then use classes and functions directly
genome = load_genome_from_features('511145.183')

See NOTEBOOK_USAGE.md for detailed notebook examples.

Usage

The script supports four modes of operation:

Mode 1: Fetch from BV-BRC API

# Basic usage
python bvbrc_to_kbase_genome.py --api <genome_id>

# With custom output file
python bvbrc_to_kbase_genome.py --api <genome_id> --output my_genome.json

# Also generate FASTA file
python bvbrc_to_kbase_genome.py --api <genome_id> --fasta features.fasta

Examples:

# E. coli K-12 MG1655
python bvbrc_to_kbase_genome.py --api 511145.183

# Mycobacterium tuberculosis H37Rv with FASTA output
python bvbrc_to_kbase_genome.py --api 83332.133 --output mtb.json --fasta mtb.fasta

Mode 2: Load from Local Genome File

# Load and validate existing genome JSON
python bvbrc_to_kbase_genome.py --local genome.json

# With custom output
python bvbrc_to_kbase_genome.py --local genome.json --output validated_genome.json

Examples:

# Validate a genome file
python bvbrc_to_kbase_genome.py --local my_genome.json

# Validate and generate FASTA
python bvbrc_to_kbase_genome.py --local my_genome.json --fasta features.fasta

Mode 3: Load from Features Directory (BV-BRC Local Files)

Load genomes from local BV-BRC feature files without re-downloading from the API.

File Structure Expected:

features/
  └── {genome_id}.json    # Feature metadata from BV-BRC API
genomes/
  └── {genome_id}.fna     # Genome sequences in FASTA format

Usage:

# Basic usage (assumes features/ and genomes/ in current directory)
python bvbrc_to_kbase_genome.py --features <genome_id>

# With custom directories
python bvbrc_to_kbase_genome.py --features <genome_id> \
  --features-dir path/to/features \
  --genomes-dir path/to/genomes

# With taxonomy and scientific name
python bvbrc_to_kbase_genome.py --features 511145.183 \
  --taxonomy "Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia" \
  --scientific-name "Escherichia coli K-12 MG1655" \
  --output ecoli_genome.json

Examples:

# Load genome from local features directory
python bvbrc_to_kbase_genome.py --features 511145.183

# With all options
python bvbrc_to_kbase_genome.py --features 1110693.3 \
  --features-dir /path/to/features \
  --genomes-dir /path/to/genomes \
  --taxonomy "Bacteria; Firmicutes" \
  --scientific-name "Bacillus subtilis" \
  --output bacillus.json \
  --fasta bacillus.fasta

What's Loaded:

Features: JSON array from features/{genome_id}.json containing:
- patric_id - PATRIC feature identifier
- product - Gene product description
- feature_type - CDS, rRNA, tRNA, etc.
- pgfam_id, plfam_id, figfam_id - Family identifiers
- annotation - Annotation source
Sequences: FASTA from genomes/{genome_id}.fna containing genome contigs
Calculated: GC content, genome MD5, DNA size, contig counts

Note: The features JSON files from BV-BRC API (downloaded with limited select fields) don't include sequence data or precise location coordinates. The script creates features with metadata only. For complete feature sequences, use Mode 1 (API) which fetches full data.

Mode 4: Create Synthetic Genome (from Multiple Genomes)

# Create synthetic genome from multiple source genomes
python bvbrc_to_kbase_genome.py --synthetic <asv_id> --genomes genome1.json,genome2.json,genome3.json

# With taxonomy and template
python bvbrc_to_kbase_genome.py --synthetic ASV_001 \
  --genomes g1.json,g2.json,g3.json \
  --taxonomy "Bacteria; Firmicutes; Bacilli" \
  --template template_genome.json \
  --output asv_001.json

Examples:

# Create synthetic genome from 3 source genomes
python bvbrc_to_kbase_genome.py --synthetic ASV_12345 \
  --genomes genome1.json,genome2.json,genome3.json

# With all options
python bvbrc_to_kbase_genome.py --synthetic genus.species \
  --genomes /path/to/g*.json \
  --taxonomy "Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales" \
  --output synthetic_genome.json \
  --fasta synthetic_features.fasta

Legacy Mode (Backward Compatible)

The original command-line interface is still supported:

# Old style (automatically uses BV-BRC API)
python bvbrc_to_kbase_genome.py <genome_id>
python bvbrc_to_kbase_genome.py <genome_id> <output_file>

Output Format

The script generates a KBase Genome object with the following structure:

{
  "id": "genome_id",
  "scientific_name": "Organism name",
  "domain": "Bacteria|Archaea|Eukaryota",
  "taxonomy": "taxonomic lineage",
  "genetic_code": 11,
  "dna_size": 12345678,
  "num_contigs": 10,
  "contig_ids": ["contig1", "contig2", ...],
  "contig_lengths": [1000, 2000, ...],
  "gc_content": 0.52,
  "md5": "genome_md5_hash",
  "molecule_type": "DNA",
  "source": "PATRIC",
  "source_id": "original_genome_id",
  "assembly_ref": "",
  "features": [...],
  "non_coding_features": [...],
  "cdss": [...],
  "mrnas": [],
  "feature_counts": {...},
  "publications": [],
  "genome_tiers": ["ExternalDB", "User"],
  "warnings": [],
  "taxon_ref": ""
}

Feature Structure

Each feature contains:

{
  "id": "genome_id_0",
  "type": "CDS|gene|tRNA|rRNA|...",
  "location": [["contig_id", start, strand, length]],
  "functions": ["functional annotation"],
  "aliases": [["source", "value"]],
  "dna_sequence": "ATCG...",
  "dna_sequence_length": 1000,
  "md5": "sequence_md5",
  "protein_translation": "MKTL...",
  "protein_translation_length": 333,
  "protein_md5": "protein_md5"
}

What's Included

BV-BRC API Mode

Genome Metadata: Name, taxonomy, GC content, genetic code
Contigs: All genome sequences with IDs and lengths
Features: All genes, CDS, RNA features with:
- DNA and protein sequences
- Functional annotations
- PATRIC and RefSeq identifiers
- Location information (contig, start, strand, length)
- Ontology terms (GO, FIGFAM, PGFAM, PLFAM)
CDS Features: Automatically generated for all protein-coding genes
Feature Statistics: Counts by feature type

Synthetic Genome Mode

Merged Features: Unique functions collected from all source genomes
Feature Probabilities: Each function's frequency across source genomes
Simplified Assembly: One contig per unique function
Aggregated Metadata: Average GC content, combined DNA size
Provenance: Source genome IDs stored in source_id field
Genome MD5: Calculated from sorted protein MD5 hashes

What's NOT Included

assembly_ref: Left blank (set to empty string) as per requirements
taxon_ref: Left blank (would require KBase taxonomy lookup)
No upload to KBase workspace
No Assembly object creation

Synthetic Genome Workflow

The synthetic genome mode (adapted from the BV_BRC-Copy1.ipynb notebook) works as follows:

Load Source Genomes: Reads multiple genome JSON files
Collect Unique Functions:
- Iterates through all features in all source genomes
- For each unique function annotation, creates one feature in the synthetic genome
- Tracks how many source genomes contain each function
Calculate Probabilities:
- Each function's probability = (# genomes with function) / (total genomes)
- Stored for downstream analysis (not in the genome object itself)
Create Simplified Assembly:
- One contig per unique feature
- Contig ID: {asv_id}_{index}.contig
- Feature location: entire contig (start=1, length=DNA sequence length)
Aggregate Metadata:
- Average GC content across all source genomes
- Sum of all feature DNA lengths for total DNA size
- Taxonomy from user input or first source genome
Generate MD5:
- Collect protein MD5 from each feature
- Sort MD5 list
- Calculate MD5 hash of sorted, semicolon-joined MD5 strings

This approach creates a consensus genome representing the union of functions found across multiple related genomes, useful for:

ASV (Amplicon Sequence Variant) genome reconstruction
Pangenome or core genome representation
Genus-level representative genomes
Synthetic communities

Finding BV-BRC Genome IDs

Visit https://www.bv-brc.org/
Search for your organism
Click on the genome
The genome ID is shown in the overview (format: XXXXX.YYY)

API Endpoints Used

The script queries these BV-BRC API endpoints:

/genome/ - Genome metadata
/genome_sequence/ - Contig sequences
/genome_feature/ - Gene/feature annotations (paginated)
/feature_sequence/ - DNA and protein sequences (batched)

Notes

SSL verification is disabled (following the Perl implementation pattern)
Features are retrieved in batches of 10,000
Sequences are fetched in batches of 100 by MD5 hash
Large genomes may take several minutes to process
The output JSON file can be quite large (tens of MB for typical bacterial genomes)

Troubleshooting

Genome ID not found

Verify the genome ID exists at https://www.bv-brc.org/
Ensure the ID format is correct (e.g., 511145.183)

SSL/Certificate errors

The script disables SSL verification by default
If you encounter issues, check your network connection

Memory issues with large genomes

The script loads all data into memory
For very large genomes, you may need to increase available RAM

Integration with KBase

This script generates JSON that is compatible with KBase Genome objects but does NOT:

Upload to KBase workspace
Create Assembly objects
Validate against KBase type specs
Require KBase authentication

To upload the resulting JSON to KBase, you would need to:

Create an Assembly object separately
Update the assembly_ref field
Use KBase workspace client to save the object
Validate against the KBase.Genome type specification

License

This script is part of the GenomeImporter repository and follows the same license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KBase Genome Converter

Features

BV-BRC API Mode

Local Genome Mode

Synthetic Genome Mode

General Features

Requirements

Installation

For Command-Line Use

For Notebook/Library Use

Usage

Mode 1: Fetch from BV-BRC API

Mode 2: Load from Local Genome File

Mode 3: Load from Features Directory (BV-BRC Local Files)

Mode 4: Create Synthetic Genome (from Multiple Genomes)

Legacy Mode (Backward Compatible)

Output Format

Feature Structure

What's Included

BV-BRC API Mode

Synthetic Genome Mode

What's NOT Included

Synthetic Genome Workflow

Finding BV-BRC Genome IDs

API Endpoints Used

Notes

Troubleshooting

Genome ID not found

SSL/Certificate errors

Memory issues with large genomes

Integration with KBase

License

FilesExpand file tree

BVBRC_CONVERTER_README.md

Latest commit

History

BVBRC_CONVERTER_README.md

File metadata and controls

KBase Genome Converter

Features

BV-BRC API Mode

Local Genome Mode

Synthetic Genome Mode

General Features

Requirements

Installation

For Command-Line Use

For Notebook/Library Use

Usage

Mode 1: Fetch from BV-BRC API

Mode 2: Load from Local Genome File

Mode 3: Load from Features Directory (BV-BRC Local Files)

Mode 4: Create Synthetic Genome (from Multiple Genomes)

Legacy Mode (Backward Compatible)

Output Format

Feature Structure

What's Included

BV-BRC API Mode

Synthetic Genome Mode

What's NOT Included

Synthetic Genome Workflow

Finding BV-BRC Genome IDs

API Endpoints Used

Notes

Troubleshooting

Genome ID not found

SSL/Certificate errors

Memory issues with large genomes

Integration with KBase

License