A Python script that creates KBase Genome objects from multiple sources:
- Fetches genome data from the BV-BRC (formerly PATRIC) API
- Loads and validates local genome JSON files
- Creates synthetic genomes by merging multiple source genomes
- Fetches genome metadata, contigs, and features from BV-BRC API
- Converts to KBase Genome object format
- Handles pagination for large feature sets
- Retrieves DNA and protein sequences for all features
- Captures ontology terms (GO, FIGFAM, PGFAM, PLFAM, RefSeq)
- Creates CDS features automatically for protein-coding genes
- Loads genome from local JSON files
- Validates and ensures KBase format compliance
- Fills in missing required fields with sensible defaults
- Merges multiple source genomes into a single synthetic genome
- Collects unique functions across all source genomes
- Calculates feature probabilities (frequency across source genomes)
- Creates one contig per unique function
- Computes average GC content
- Generates genome MD5 from sorted protein sequences
- Adapted from notebook code for ASV genome creation
- Saves genome as JSON file locally
- Optional FASTA file generation with feature sequences
- Does not require KBase authentication or upload
- Backward compatible with original command-line interface
pip install requests numpyNote: numpy is only required for synthetic genome mode (calculating average GC content).
Simply download the script and run it directly.
# Option 1: Add to Python path
import sys
sys.path.append('/path/to/GenomeImporter')
from bvbrc_to_kbase_genome import *
# Option 2: Run in notebook
%run bvbrc_to_kbase_genome.py
# Then use classes and functions directly
genome = load_genome_from_features('511145.183')See NOTEBOOK_USAGE.md for detailed notebook examples.
The script supports four modes of operation:
# Basic usage
python bvbrc_to_kbase_genome.py --api <genome_id>
# With custom output file
python bvbrc_to_kbase_genome.py --api <genome_id> --output my_genome.json
# Also generate FASTA file
python bvbrc_to_kbase_genome.py --api <genome_id> --fasta features.fastaExamples:
# E. coli K-12 MG1655
python bvbrc_to_kbase_genome.py --api 511145.183
# Mycobacterium tuberculosis H37Rv with FASTA output
python bvbrc_to_kbase_genome.py --api 83332.133 --output mtb.json --fasta mtb.fasta# Load and validate existing genome JSON
python bvbrc_to_kbase_genome.py --local genome.json
# With custom output
python bvbrc_to_kbase_genome.py --local genome.json --output validated_genome.jsonExamples:
# Validate a genome file
python bvbrc_to_kbase_genome.py --local my_genome.json
# Validate and generate FASTA
python bvbrc_to_kbase_genome.py --local my_genome.json --fasta features.fastaLoad genomes from local BV-BRC feature files without re-downloading from the API.
File Structure Expected:
features/
└── {genome_id}.json # Feature metadata from BV-BRC API
genomes/
└── {genome_id}.fna # Genome sequences in FASTA format
Usage:
# Basic usage (assumes features/ and genomes/ in current directory)
python bvbrc_to_kbase_genome.py --features <genome_id>
# With custom directories
python bvbrc_to_kbase_genome.py --features <genome_id> \
--features-dir path/to/features \
--genomes-dir path/to/genomes
# With taxonomy and scientific name
python bvbrc_to_kbase_genome.py --features 511145.183 \
--taxonomy "Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia" \
--scientific-name "Escherichia coli K-12 MG1655" \
--output ecoli_genome.jsonExamples:
# Load genome from local features directory
python bvbrc_to_kbase_genome.py --features 511145.183
# With all options
python bvbrc_to_kbase_genome.py --features 1110693.3 \
--features-dir /path/to/features \
--genomes-dir /path/to/genomes \
--taxonomy "Bacteria; Firmicutes" \
--scientific-name "Bacillus subtilis" \
--output bacillus.json \
--fasta bacillus.fastaWhat's Loaded:
- Features: JSON array from
features/{genome_id}.jsoncontaining:patric_id- PATRIC feature identifierproduct- Gene product descriptionfeature_type- CDS, rRNA, tRNA, etc.pgfam_id,plfam_id,figfam_id- Family identifiersannotation- Annotation source
- Sequences: FASTA from
genomes/{genome_id}.fnacontaining genome contigs - Calculated: GC content, genome MD5, DNA size, contig counts
Note: The features JSON files from BV-BRC API (downloaded with limited select fields) don't include sequence data or precise location coordinates. The script creates features with metadata only. For complete feature sequences, use Mode 1 (API) which fetches full data.
# Create synthetic genome from multiple source genomes
python bvbrc_to_kbase_genome.py --synthetic <asv_id> --genomes genome1.json,genome2.json,genome3.json
# With taxonomy and template
python bvbrc_to_kbase_genome.py --synthetic ASV_001 \
--genomes g1.json,g2.json,g3.json \
--taxonomy "Bacteria; Firmicutes; Bacilli" \
--template template_genome.json \
--output asv_001.jsonExamples:
# Create synthetic genome from 3 source genomes
python bvbrc_to_kbase_genome.py --synthetic ASV_12345 \
--genomes genome1.json,genome2.json,genome3.json
# With all options
python bvbrc_to_kbase_genome.py --synthetic genus.species \
--genomes /path/to/g*.json \
--taxonomy "Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales" \
--output synthetic_genome.json \
--fasta synthetic_features.fastaThe original command-line interface is still supported:
# Old style (automatically uses BV-BRC API)
python bvbrc_to_kbase_genome.py <genome_id>
python bvbrc_to_kbase_genome.py <genome_id> <output_file>The script generates a KBase Genome object with the following structure:
{
"id": "genome_id",
"scientific_name": "Organism name",
"domain": "Bacteria|Archaea|Eukaryota",
"taxonomy": "taxonomic lineage",
"genetic_code": 11,
"dna_size": 12345678,
"num_contigs": 10,
"contig_ids": ["contig1", "contig2", ...],
"contig_lengths": [1000, 2000, ...],
"gc_content": 0.52,
"md5": "genome_md5_hash",
"molecule_type": "DNA",
"source": "PATRIC",
"source_id": "original_genome_id",
"assembly_ref": "",
"features": [...],
"non_coding_features": [...],
"cdss": [...],
"mrnas": [],
"feature_counts": {...},
"publications": [],
"genome_tiers": ["ExternalDB", "User"],
"warnings": [],
"taxon_ref": ""
}Each feature contains:
{
"id": "genome_id_0",
"type": "CDS|gene|tRNA|rRNA|...",
"location": [["contig_id", start, strand, length]],
"functions": ["functional annotation"],
"aliases": [["source", "value"]],
"dna_sequence": "ATCG...",
"dna_sequence_length": 1000,
"md5": "sequence_md5",
"protein_translation": "MKTL...",
"protein_translation_length": 333,
"protein_md5": "protein_md5"
}- Genome Metadata: Name, taxonomy, GC content, genetic code
- Contigs: All genome sequences with IDs and lengths
- Features: All genes, CDS, RNA features with:
- DNA and protein sequences
- Functional annotations
- PATRIC and RefSeq identifiers
- Location information (contig, start, strand, length)
- Ontology terms (GO, FIGFAM, PGFAM, PLFAM)
- CDS Features: Automatically generated for all protein-coding genes
- Feature Statistics: Counts by feature type
- Merged Features: Unique functions collected from all source genomes
- Feature Probabilities: Each function's frequency across source genomes
- Simplified Assembly: One contig per unique function
- Aggregated Metadata: Average GC content, combined DNA size
- Provenance: Source genome IDs stored in
source_idfield - Genome MD5: Calculated from sorted protein MD5 hashes
assembly_ref: Left blank (set to empty string) as per requirementstaxon_ref: Left blank (would require KBase taxonomy lookup)- No upload to KBase workspace
- No Assembly object creation
The synthetic genome mode (adapted from the BV_BRC-Copy1.ipynb notebook) works as follows:
- Load Source Genomes: Reads multiple genome JSON files
- Collect Unique Functions:
- Iterates through all features in all source genomes
- For each unique function annotation, creates one feature in the synthetic genome
- Tracks how many source genomes contain each function
- Calculate Probabilities:
- Each function's probability = (# genomes with function) / (total genomes)
- Stored for downstream analysis (not in the genome object itself)
- Create Simplified Assembly:
- One contig per unique feature
- Contig ID:
{asv_id}_{index}.contig - Feature location: entire contig (start=1, length=DNA sequence length)
- Aggregate Metadata:
- Average GC content across all source genomes
- Sum of all feature DNA lengths for total DNA size
- Taxonomy from user input or first source genome
- Generate MD5:
- Collect protein MD5 from each feature
- Sort MD5 list
- Calculate MD5 hash of sorted, semicolon-joined MD5 strings
This approach creates a consensus genome representing the union of functions found across multiple related genomes, useful for:
- ASV (Amplicon Sequence Variant) genome reconstruction
- Pangenome or core genome representation
- Genus-level representative genomes
- Synthetic communities
- Visit https://www.bv-brc.org/
- Search for your organism
- Click on the genome
- The genome ID is shown in the overview (format:
XXXXX.YYY)
The script queries these BV-BRC API endpoints:
/genome/- Genome metadata/genome_sequence/- Contig sequences/genome_feature/- Gene/feature annotations (paginated)/feature_sequence/- DNA and protein sequences (batched)
- SSL verification is disabled (following the Perl implementation pattern)
- Features are retrieved in batches of 10,000
- Sequences are fetched in batches of 100 by MD5 hash
- Large genomes may take several minutes to process
- The output JSON file can be quite large (tens of MB for typical bacterial genomes)
- Verify the genome ID exists at https://www.bv-brc.org/
- Ensure the ID format is correct (e.g.,
511145.183)
- The script disables SSL verification by default
- If you encounter issues, check your network connection
- The script loads all data into memory
- For very large genomes, you may need to increase available RAM
This script generates JSON that is compatible with KBase Genome objects but does NOT:
- Upload to KBase workspace
- Create Assembly objects
- Validate against KBase type specs
- Require KBase authentication
To upload the resulting JSON to KBase, you would need to:
- Create an Assembly object separately
- Update the
assembly_reffield - Use KBase workspace client to save the object
- Validate against the KBase.Genome type specification
This script is part of the GenomeImporter repository and follows the same license.