|
| 1 | +# CIRCpedia V3 Import Implementation |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +This document summarizes the implementation of CIRCpedia V3 circular RNA data import for the RNAcentral pipeline. |
| 6 | + |
| 7 | +## What is CIRCpedia V3? |
| 8 | + |
| 9 | +CIRCpedia V3 is a comprehensive circular RNA database published in Nucleic Acids Research (2025): |
| 10 | +- **2.6 million circular RNAs** from 20 species |
| 11 | +- Expression data from **2350 NGS and 63 TGS datasets** |
| 12 | +- Community-recommended nomenclature |
| 13 | +- Comprehensive genomic annotations |
| 14 | +- Available at: https://bits.fudan.edu.cn/circpediav3 |
| 15 | + |
| 16 | +**Reference**: Zhai SN, Zhang YY, Chen MH, Fu ZC, Chen LL, Ma XK, Yang L. DOI: 10.1093/nar/gkaf1039 |
| 17 | + |
| 18 | +## Implementation Overview |
| 19 | + |
| 20 | +A complete import pipeline has been implemented following RNAcentral's architecture patterns, including: |
| 21 | + |
| 22 | +### 1. Python Parser Module |
| 23 | +**Location**: `rnacentral_pipeline/databases/circpedia/` |
| 24 | + |
| 25 | +Files created: |
| 26 | +- `__init__.py` - Module initialization |
| 27 | +- `parser.py` - Main TSV/FASTA parsing logic using Polars |
| 28 | +- `helpers.py` - Helper functions for data processing |
| 29 | +- `README.md` - Comprehensive documentation |
| 30 | +- `example_annotation.txt` - Sample TSV annotation file |
| 31 | +- `example_sequences.fa` - Sample FASTA sequence file |
| 32 | + |
| 33 | +**Key Features**: |
| 34 | +- Parses TSV annotation files with circular RNA metadata |
| 35 | +- Loads sequences from FASTA files |
| 36 | +- Converts to RNAcentral Entry format |
| 37 | +- Handles taxonomy lookups with fallback mappings |
| 38 | +- Processes combined genomic location fields (chr:start-end(strand)) |
| 39 | +- Uses **Polars** for efficient dataframe operations (as requested) |
| 40 | +- Generates proper RNA type (SO:0000593 for circular RNA) |
| 41 | +- Uses CIRCpedia's own IDs with CIRCPEDIA: prefix for accessions |
| 42 | +- Creates direct links to circRNA detail pages |
| 43 | + |
| 44 | +### 2. CLI Integration |
| 45 | +**Location**: `rnacentral_pipeline/cli/circpedia.py` |
| 46 | + |
| 47 | +Command: |
| 48 | +```bash |
| 49 | +rnac circpedia parse <taxonomy> <annotation_file> <fasta_file> <output> [--assembly ASSEMBLY] |
| 50 | +``` |
| 51 | + |
| 52 | +Registered in: `rnacentral_pipeline/cli/__init__.py` |
| 53 | + |
| 54 | +### 3. Nextflow Workflow |
| 55 | +**Location**: `workflows/databases/circpedia.nf` |
| 56 | + |
| 57 | +Workflow processes: |
| 58 | +- `fetch_data`: Downloads CIRCpedia annotation and FASTA files from configured sources |
| 59 | +- `parse_data`: Runs parser with taxonomy context and both input files |
| 60 | +- Outputs standard CSV files for RNAcentral loading |
| 61 | + |
| 62 | +Integrated into: `workflows/parse-databases.nf` |
| 63 | + |
| 64 | +### 4. Configuration |
| 65 | +**Location**: `config/databases.config` |
| 66 | + |
| 67 | +Configuration added: |
| 68 | +```groovy |
| 69 | +circpedia { |
| 70 | + run = false // Disabled by default |
| 71 | + needs_taxonomy = true // Requires taxonomy database |
| 72 | + process.directives.memory = 8.GB |
| 73 | + remote { |
| 74 | + annotation = '/nfs/production/agb/rnacentral/provided-data/circpedia/circpedia_v3_annotation.txt' |
| 75 | + fasta = '/nfs/production/agb/rnacentral/provided-data/circpedia/circpedia_v3_sequences.fa' |
| 76 | + } |
| 77 | + assembly = 'GRCh38' |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +### 5. Unit Tests |
| 82 | +**Location**: `tests/databases/circpedia/` |
| 83 | + |
| 84 | +Files created: |
| 85 | +- `__init__.py` |
| 86 | +- `helpers_test.py` - Tests for helper functions (35 test cases) |
| 87 | +- `parser_test.py` - Tests for parser logic (13 test cases) |
| 88 | + |
| 89 | +**Test Design**: |
| 90 | +- Uses minimal mock data |
| 91 | +- No network dependencies |
| 92 | +- No RNAcentral database dependencies |
| 93 | +- No large test data files |
| 94 | +- All tests can run in isolation |
| 95 | + |
| 96 | +## Data Flow |
| 97 | + |
| 98 | +``` |
| 99 | +CIRCpedia Annotation (TSV) + Sequences (FASTA) |
| 100 | + ↓ |
| 101 | +Nextflow: fetch_data |
| 102 | + ↓ |
| 103 | +Nextflow: parse_data |
| 104 | + ↓ |
| 105 | +Python: parser.parse() |
| 106 | + ↓ |
| 107 | +1. Load sequences from FASTA into dictionary |
| 108 | +2. Read TSV with Polars |
| 109 | +3. For each row: |
| 110 | + - Parse combined location field (chr:start-end(strand)) |
| 111 | + - Lookup taxonomy ID |
| 112 | + - Get sequence from FASTA dictionary |
| 113 | + - Create Entry object with CIRCPEDIA: prefix |
| 114 | + ↓ |
| 115 | +EntryWriter |
| 116 | + ↓ |
| 117 | +Standard RNAcentral CSV files: |
| 118 | + - accessions.csv |
| 119 | + - seq_short.csv / seq_long.csv |
| 120 | + - regions.csv |
| 121 | + - references.csv |
| 122 | + - etc. |
| 123 | +``` |
| 124 | + |
| 125 | +## Expected Data Format |
| 126 | + |
| 127 | +### Annotation File (TSV) |
| 128 | + |
| 129 | +#### Required Columns |
| 130 | +- `circID` - CIRCpedia circular RNA ID |
| 131 | +- `species` - Species name (e.g., "Homo sapiens") |
| 132 | +- `Location` - Combined genomic location with strand (e.g., "V:15874634-15876408(-)") |
| 133 | + |
| 134 | +#### Optional Columns |
| 135 | +- `gene_Refseq` - RefSeq gene identifier |
| 136 | +- `gene_Ensembl` - Ensembl gene identifier |
| 137 | +- `circname` - Circular RNA name |
| 138 | +- `length` - Length of circular RNA |
| 139 | +- `subcell_location` - Subcellular localization |
| 140 | +- `editing_site` - RNA editing sites |
| 141 | +- `DIS3_signal` - DIS3 degradation signals |
| 142 | +- `Orthology` - Orthology information |
| 143 | +- `TGS` - Third-generation sequencing support |
| 144 | +- `transcript_Ensembl` - Ensembl transcript ID |
| 145 | +- `transcript_Refseq` - RefSeq transcript ID |
| 146 | + |
| 147 | +### Sequence File (FASTA) |
| 148 | + |
| 149 | +Standard FASTA format with headers matching circIDs: |
| 150 | +``` |
| 151 | +>circID_1 |
| 152 | +ATCGATCG... |
| 153 | +>circID_2 |
| 154 | +GCTAGCTA... |
| 155 | +``` |
| 156 | + |
| 157 | +## Technology Stack |
| 158 | + |
| 159 | +Following requirements: |
| 160 | +- ✅ **Polars** for dataframe operations (instead of pandas) |
| 161 | +- ✅ **psycopg** ready for database queries (if needed) |
| 162 | +- ✅ Standard RNAcentral patterns (Entry, EntryWriter, etc.) |
| 163 | +- ✅ Nextflow for pipeline orchestration |
| 164 | +- ✅ Click for CLI interface |
| 165 | + |
| 166 | +## Testing |
| 167 | + |
| 168 | +Run tests: |
| 169 | +```bash |
| 170 | +# All circpedia tests |
| 171 | +pytest tests/databases/circpedia/ |
| 172 | + |
| 173 | +# Specific test files |
| 174 | +pytest tests/databases/circpedia/helpers_test.py |
| 175 | +pytest tests/databases/circpedia/parser_test.py |
| 176 | + |
| 177 | +# With coverage |
| 178 | +pytest --cov=rnacentral_pipeline.databases.circpedia tests/databases/circpedia/ |
| 179 | +``` |
| 180 | + |
| 181 | +## Files Created/Modified |
| 182 | + |
| 183 | +### New Files (12) |
| 184 | +``` |
| 185 | +rnacentral_pipeline/databases/circpedia/ |
| 186 | + ├── __init__.py |
| 187 | + ├── parser.py |
| 188 | + ├── helpers.py |
| 189 | + ├── README.md |
| 190 | + ├── example_annotation.txt |
| 191 | + └── example_sequences.fa |
| 192 | +
|
| 193 | +rnacentral_pipeline/cli/ |
| 194 | + └── circpedia.py |
| 195 | +
|
| 196 | +workflows/databases/ |
| 197 | + └── circpedia.nf |
| 198 | +
|
| 199 | +tests/databases/circpedia/ |
| 200 | + ├── __init__.py |
| 201 | + ├── helpers_test.py |
| 202 | + └── parser_test.py |
| 203 | +
|
| 204 | +CIRCPEDIA_IMPLEMENTATION.md |
| 205 | +``` |
| 206 | + |
| 207 | +### Modified Files (3) |
| 208 | +``` |
| 209 | +rnacentral_pipeline/cli/__init__.py |
| 210 | + - Added circpedia import |
| 211 | + - Registered circpedia CLI command |
| 212 | +
|
| 213 | +workflows/parse-databases.nf |
| 214 | + - Added circpedia workflow include |
| 215 | + - Added circpedia to workflow mix |
| 216 | +
|
| 217 | +config/databases.config |
| 218 | + - Added circpedia configuration |
| 219 | +``` |
| 220 | + |
| 221 | +## Usage Instructions |
| 222 | + |
| 223 | +### For Development/Testing |
| 224 | + |
| 225 | +1. **Prepare test data**: |
| 226 | + ```bash |
| 227 | + # Use example data |
| 228 | + cp rnacentral_pipeline/databases/circpedia/example_annotation.txt test_annotation.txt |
| 229 | + cp rnacentral_pipeline/databases/circpedia/example_sequences.fa test_sequences.fa |
| 230 | + ``` |
| 231 | + |
| 232 | +2. **Build taxonomy context** (if not already available): |
| 233 | + ```bash |
| 234 | + # This is normally done by the pipeline |
| 235 | + wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz |
| 236 | + tar xzf new_taxdump.tar.gz |
| 237 | + mkdir taxdump |
| 238 | + mv *.dmp taxdump |
| 239 | + rnac context build taxdump context.db |
| 240 | + ``` |
| 241 | + |
| 242 | +3. **Parse data**: |
| 243 | + ```bash |
| 244 | + rnac circpedia parse context.db test_annotation.txt test_sequences.fa output/ --assembly GRCh38 |
| 245 | + ``` |
| 246 | + |
| 247 | +### For Production |
| 248 | + |
| 249 | +1. **Enable in config**: |
| 250 | + Edit `config/databases.config`: |
| 251 | + ```groovy |
| 252 | + circpedia { |
| 253 | + run = true // Enable |
| 254 | + remote { |
| 255 | + annotation = '/path/to/circpedia_v3_annotation.txt' |
| 256 | + fasta = '/path/to/circpedia_v3_sequences.fa' |
| 257 | + } |
| 258 | + } |
| 259 | + ``` |
| 260 | + |
| 261 | +2. **Run pipeline**: |
| 262 | + ```bash |
| 263 | + nextflow run main.nf -profile standard |
| 264 | + ``` |
| 265 | + |
| 266 | +## Implementation Notes |
| 267 | + |
| 268 | +### Circular RNA Specifics |
| 269 | +- **RNA Type**: SO:0000593 (circular RNA from Sequence Ontology) |
| 270 | +- **Database**: CIRCPEDIA (uppercase as per RNAcentral convention) |
| 271 | +- **Accessions**: Use CIRCpedia's own IDs with CIRCPEDIA: prefix (e.g., "CIRCPEDIA:hsa_circ_0001_1:100-200") |
| 272 | +- **URLs**: Direct links to circRNA detail pages (https://bits.fudan.edu.cn/circpediav3/circrna/{circ_id}) |
| 273 | +- **Coordinate System**: 1-based, fully-closed (same as GFF/GTF format) |
| 274 | + |
| 275 | +### Sequence Handling |
| 276 | +- Sequences loaded from separate FASTA file |
| 277 | +- Sequences keyed by circID |
| 278 | +- DNA sequences (T not U) - automatically converted if needed |
| 279 | +- If circID missing from FASTA: entry skipped with warning |
| 280 | +- Production may need special handling for back-splice junctions |
| 281 | + |
| 282 | +### Species Support |
| 283 | +Includes fallback taxonomy mapping for 20 common species: |
| 284 | +- Human, Mouse, Rat, Zebrafish, Fruit fly, C. elegans |
| 285 | +- Yeast, Arabidopsis, Chicken, Chimp, Macaque, Dog |
| 286 | +- Cow, Pig, Xenopus, Medaka, Fugu, Sea urchin |
| 287 | +- Sea squirt, Mosquito |
| 288 | + |
| 289 | +### Performance Considerations |
| 290 | +- Uses Polars for efficient CSV processing |
| 291 | +- Memory: 8GB configured (adjustable based on data size) |
| 292 | +- Batch processing with progress logging every 10,000 rows |
| 293 | +- Taxonomy database: SqliteDict for fast lookups |
| 294 | + |
| 295 | +## Future Enhancements |
| 296 | + |
| 297 | +Potential production improvements: |
| 298 | + |
| 299 | +1. **Sequence Extraction** |
| 300 | + - Integrate with genome assemblies |
| 301 | + - Extract sequences from coordinates |
| 302 | + - Handle back-splice junctions |
| 303 | + |
| 304 | +2. **Expression Data** |
| 305 | + - Expand expression profile support |
| 306 | + - Store TPM/FPKM values |
| 307 | + - Link to expression atlas |
| 308 | + |
| 309 | +3. **Alternative Splicing** |
| 310 | + - Track alternative back-splicing events |
| 311 | + - Multiple isoforms per circRNA |
| 312 | + |
| 313 | +4. **Conservation** |
| 314 | + - Import conservation scores |
| 315 | + - Cross-species mappings |
| 316 | + |
| 317 | +5. **Data Quality** |
| 318 | + - Validation checks |
| 319 | + - Duplicate detection |
| 320 | + - Coordinate verification |
| 321 | + |
| 322 | +6. **Performance** |
| 323 | + - Chunked processing for large files |
| 324 | + - Parallel parsing |
| 325 | + - Optimized memory usage |
| 326 | + |
| 327 | +## Sources |
| 328 | + |
| 329 | +Research sources used: |
| 330 | + |
| 331 | +- [CIRCpedia v3 Paper - NAR 2025](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaf1039/8296757) |
| 332 | +- [CIRCpedia v2 Documentation](https://pmc.ncbi.nlm.nih.gov/articles/PMC6203687/) |
| 333 | +- [circAtlas Database Information](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02018-y) |
| 334 | + |
| 335 | +## License |
| 336 | + |
| 337 | +Copyright [2009-2025] EMBL-European Bioinformatics Institute |
| 338 | + |
| 339 | +Licensed under the Apache License, Version 2.0 |
| 340 | + |
| 341 | +--- |
| 342 | + |
| 343 | +**Implementation Status**: ✅ Complete (Local only - not pushed to remote as requested) |
| 344 | + |
| 345 | +**Branch**: `claude/add-circpedia-v3-m3I0I` |
| 346 | + |
| 347 | +**Ready for**: Code review, testing with real data, integration into pipeline |
0 commit comments