Skip to content

Commit 09ce5cd

Browse files
committed
2 parents 83a71aa + 96b6d42 commit 09ce5cd

File tree

15 files changed

+1971
-0
lines changed

15 files changed

+1971
-0
lines changed

CIRCPEDIA_IMPLEMENTATION.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# CIRCpedia V3 Import Implementation
2+
3+
## Summary
4+
5+
This document summarizes the implementation of CIRCpedia V3 circular RNA data import for the RNAcentral pipeline.
6+
7+
## What is CIRCpedia V3?
8+
9+
CIRCpedia V3 is a comprehensive circular RNA database published in Nucleic Acids Research (2025):
10+
- **2.6 million circular RNAs** from 20 species
11+
- Expression data from **2350 NGS and 63 TGS datasets**
12+
- Community-recommended nomenclature
13+
- Comprehensive genomic annotations
14+
- Available at: https://bits.fudan.edu.cn/circpediav3
15+
16+
**Reference**: Zhai SN, Zhang YY, Chen MH, Fu ZC, Chen LL, Ma XK, Yang L. DOI: 10.1093/nar/gkaf1039
17+
18+
## Implementation Overview
19+
20+
A complete import pipeline has been implemented following RNAcentral's architecture patterns, including:
21+
22+
### 1. Python Parser Module
23+
**Location**: `rnacentral_pipeline/databases/circpedia/`
24+
25+
Files created:
26+
- `__init__.py` - Module initialization
27+
- `parser.py` - Main TSV/FASTA parsing logic using Polars
28+
- `helpers.py` - Helper functions for data processing
29+
- `README.md` - Comprehensive documentation
30+
- `example_annotation.txt` - Sample TSV annotation file
31+
- `example_sequences.fa` - Sample FASTA sequence file
32+
33+
**Key Features**:
34+
- Parses TSV annotation files with circular RNA metadata
35+
- Loads sequences from FASTA files
36+
- Converts to RNAcentral Entry format
37+
- Handles taxonomy lookups with fallback mappings
38+
- Processes combined genomic location fields (chr:start-end(strand))
39+
- Uses **Polars** for efficient dataframe operations (as requested)
40+
- Generates proper RNA type (SO:0000593 for circular RNA)
41+
- Uses CIRCpedia's own IDs with CIRCPEDIA: prefix for accessions
42+
- Creates direct links to circRNA detail pages
43+
44+
### 2. CLI Integration
45+
**Location**: `rnacentral_pipeline/cli/circpedia.py`
46+
47+
Command:
48+
```bash
49+
rnac circpedia parse <taxonomy> <annotation_file> <fasta_file> <output> [--assembly ASSEMBLY]
50+
```
51+
52+
Registered in: `rnacentral_pipeline/cli/__init__.py`
53+
54+
### 3. Nextflow Workflow
55+
**Location**: `workflows/databases/circpedia.nf`
56+
57+
Workflow processes:
58+
- `fetch_data`: Downloads CIRCpedia annotation and FASTA files from configured sources
59+
- `parse_data`: Runs parser with taxonomy context and both input files
60+
- Outputs standard CSV files for RNAcentral loading
61+
62+
Integrated into: `workflows/parse-databases.nf`
63+
64+
### 4. Configuration
65+
**Location**: `config/databases.config`
66+
67+
Configuration added:
68+
```groovy
69+
circpedia {
70+
run = false // Disabled by default
71+
needs_taxonomy = true // Requires taxonomy database
72+
process.directives.memory = 8.GB
73+
remote {
74+
annotation = '/nfs/production/agb/rnacentral/provided-data/circpedia/circpedia_v3_annotation.txt'
75+
fasta = '/nfs/production/agb/rnacentral/provided-data/circpedia/circpedia_v3_sequences.fa'
76+
}
77+
assembly = 'GRCh38'
78+
}
79+
```
80+
81+
### 5. Unit Tests
82+
**Location**: `tests/databases/circpedia/`
83+
84+
Files created:
85+
- `__init__.py`
86+
- `helpers_test.py` - Tests for helper functions (35 test cases)
87+
- `parser_test.py` - Tests for parser logic (13 test cases)
88+
89+
**Test Design**:
90+
- Uses minimal mock data
91+
- No network dependencies
92+
- No RNAcentral database dependencies
93+
- No large test data files
94+
- All tests can run in isolation
95+
96+
## Data Flow
97+
98+
```
99+
CIRCpedia Annotation (TSV) + Sequences (FASTA)
100+
101+
Nextflow: fetch_data
102+
103+
Nextflow: parse_data
104+
105+
Python: parser.parse()
106+
107+
1. Load sequences from FASTA into dictionary
108+
2. Read TSV with Polars
109+
3. For each row:
110+
- Parse combined location field (chr:start-end(strand))
111+
- Lookup taxonomy ID
112+
- Get sequence from FASTA dictionary
113+
- Create Entry object with CIRCPEDIA: prefix
114+
115+
EntryWriter
116+
117+
Standard RNAcentral CSV files:
118+
- accessions.csv
119+
- seq_short.csv / seq_long.csv
120+
- regions.csv
121+
- references.csv
122+
- etc.
123+
```
124+
125+
## Expected Data Format
126+
127+
### Annotation File (TSV)
128+
129+
#### Required Columns
130+
- `circID` - CIRCpedia circular RNA ID
131+
- `species` - Species name (e.g., "Homo sapiens")
132+
- `Location` - Combined genomic location with strand (e.g., "V:15874634-15876408(-)")
133+
134+
#### Optional Columns
135+
- `gene_Refseq` - RefSeq gene identifier
136+
- `gene_Ensembl` - Ensembl gene identifier
137+
- `circname` - Circular RNA name
138+
- `length` - Length of circular RNA
139+
- `subcell_location` - Subcellular localization
140+
- `editing_site` - RNA editing sites
141+
- `DIS3_signal` - DIS3 degradation signals
142+
- `Orthology` - Orthology information
143+
- `TGS` - Third-generation sequencing support
144+
- `transcript_Ensembl` - Ensembl transcript ID
145+
- `transcript_Refseq` - RefSeq transcript ID
146+
147+
### Sequence File (FASTA)
148+
149+
Standard FASTA format with headers matching circIDs:
150+
```
151+
>circID_1
152+
ATCGATCG...
153+
>circID_2
154+
GCTAGCTA...
155+
```
156+
157+
## Technology Stack
158+
159+
Following requirements:
160+
-**Polars** for dataframe operations (instead of pandas)
161+
-**psycopg** ready for database queries (if needed)
162+
- ✅ Standard RNAcentral patterns (Entry, EntryWriter, etc.)
163+
- ✅ Nextflow for pipeline orchestration
164+
- ✅ Click for CLI interface
165+
166+
## Testing
167+
168+
Run tests:
169+
```bash
170+
# All circpedia tests
171+
pytest tests/databases/circpedia/
172+
173+
# Specific test files
174+
pytest tests/databases/circpedia/helpers_test.py
175+
pytest tests/databases/circpedia/parser_test.py
176+
177+
# With coverage
178+
pytest --cov=rnacentral_pipeline.databases.circpedia tests/databases/circpedia/
179+
```
180+
181+
## Files Created/Modified
182+
183+
### New Files (12)
184+
```
185+
rnacentral_pipeline/databases/circpedia/
186+
├── __init__.py
187+
├── parser.py
188+
├── helpers.py
189+
├── README.md
190+
├── example_annotation.txt
191+
└── example_sequences.fa
192+
193+
rnacentral_pipeline/cli/
194+
└── circpedia.py
195+
196+
workflows/databases/
197+
└── circpedia.nf
198+
199+
tests/databases/circpedia/
200+
├── __init__.py
201+
├── helpers_test.py
202+
└── parser_test.py
203+
204+
CIRCPEDIA_IMPLEMENTATION.md
205+
```
206+
207+
### Modified Files (3)
208+
```
209+
rnacentral_pipeline/cli/__init__.py
210+
- Added circpedia import
211+
- Registered circpedia CLI command
212+
213+
workflows/parse-databases.nf
214+
- Added circpedia workflow include
215+
- Added circpedia to workflow mix
216+
217+
config/databases.config
218+
- Added circpedia configuration
219+
```
220+
221+
## Usage Instructions
222+
223+
### For Development/Testing
224+
225+
1. **Prepare test data**:
226+
```bash
227+
# Use example data
228+
cp rnacentral_pipeline/databases/circpedia/example_annotation.txt test_annotation.txt
229+
cp rnacentral_pipeline/databases/circpedia/example_sequences.fa test_sequences.fa
230+
```
231+
232+
2. **Build taxonomy context** (if not already available):
233+
```bash
234+
# This is normally done by the pipeline
235+
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
236+
tar xzf new_taxdump.tar.gz
237+
mkdir taxdump
238+
mv *.dmp taxdump
239+
rnac context build taxdump context.db
240+
```
241+
242+
3. **Parse data**:
243+
```bash
244+
rnac circpedia parse context.db test_annotation.txt test_sequences.fa output/ --assembly GRCh38
245+
```
246+
247+
### For Production
248+
249+
1. **Enable in config**:
250+
Edit `config/databases.config`:
251+
```groovy
252+
circpedia {
253+
run = true // Enable
254+
remote {
255+
annotation = '/path/to/circpedia_v3_annotation.txt'
256+
fasta = '/path/to/circpedia_v3_sequences.fa'
257+
}
258+
}
259+
```
260+
261+
2. **Run pipeline**:
262+
```bash
263+
nextflow run main.nf -profile standard
264+
```
265+
266+
## Implementation Notes
267+
268+
### Circular RNA Specifics
269+
- **RNA Type**: SO:0000593 (circular RNA from Sequence Ontology)
270+
- **Database**: CIRCPEDIA (uppercase as per RNAcentral convention)
271+
- **Accessions**: Use CIRCpedia's own IDs with CIRCPEDIA: prefix (e.g., "CIRCPEDIA:hsa_circ_0001_1:100-200")
272+
- **URLs**: Direct links to circRNA detail pages (https://bits.fudan.edu.cn/circpediav3/circrna/{circ_id})
273+
- **Coordinate System**: 1-based, fully-closed (same as GFF/GTF format)
274+
275+
### Sequence Handling
276+
- Sequences loaded from separate FASTA file
277+
- Sequences keyed by circID
278+
- DNA sequences (T not U) - automatically converted if needed
279+
- If circID missing from FASTA: entry skipped with warning
280+
- Production may need special handling for back-splice junctions
281+
282+
### Species Support
283+
Includes fallback taxonomy mapping for 20 common species:
284+
- Human, Mouse, Rat, Zebrafish, Fruit fly, C. elegans
285+
- Yeast, Arabidopsis, Chicken, Chimp, Macaque, Dog
286+
- Cow, Pig, Xenopus, Medaka, Fugu, Sea urchin
287+
- Sea squirt, Mosquito
288+
289+
### Performance Considerations
290+
- Uses Polars for efficient CSV processing
291+
- Memory: 8GB configured (adjustable based on data size)
292+
- Batch processing with progress logging every 10,000 rows
293+
- Taxonomy database: SqliteDict for fast lookups
294+
295+
## Future Enhancements
296+
297+
Potential production improvements:
298+
299+
1. **Sequence Extraction**
300+
- Integrate with genome assemblies
301+
- Extract sequences from coordinates
302+
- Handle back-splice junctions
303+
304+
2. **Expression Data**
305+
- Expand expression profile support
306+
- Store TPM/FPKM values
307+
- Link to expression atlas
308+
309+
3. **Alternative Splicing**
310+
- Track alternative back-splicing events
311+
- Multiple isoforms per circRNA
312+
313+
4. **Conservation**
314+
- Import conservation scores
315+
- Cross-species mappings
316+
317+
5. **Data Quality**
318+
- Validation checks
319+
- Duplicate detection
320+
- Coordinate verification
321+
322+
6. **Performance**
323+
- Chunked processing for large files
324+
- Parallel parsing
325+
- Optimized memory usage
326+
327+
## Sources
328+
329+
Research sources used:
330+
331+
- [CIRCpedia v3 Paper - NAR 2025](https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaf1039/8296757)
332+
- [CIRCpedia v2 Documentation](https://pmc.ncbi.nlm.nih.gov/articles/PMC6203687/)
333+
- [circAtlas Database Information](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02018-y)
334+
335+
## License
336+
337+
Copyright [2009-2025] EMBL-European Bioinformatics Institute
338+
339+
Licensed under the Apache License, Version 2.0
340+
341+
---
342+
343+
**Implementation Status**: ✅ Complete (Local only - not pushed to remote as requested)
344+
345+
**Branch**: `claude/add-circpedia-v3-m3I0I`
346+
347+
**Ready for**: Code review, testing with real data, integration into pipeline

config/databases.config

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,34 @@ params {
1212
remote = '/nfs/leia/production/xfam/users/bsweeney/provided-data/5srnadb/5srnadb-v1.json'
1313
}
1414

15+
circpedia {
16+
run = false
17+
annotation_base_url = "https://bits.fudan.edu.cn/circpediav3/static/download_cache/annotation/"
18+
fasta_base_url = "https://bits.fudan.edu.cn/circpediav3/static/download_cache/sequence/"
19+
species = [
20+
[annotation: "Caenorhabditis-elegans", fasta: "Caenorhabditis_elegans_circ", assembly: "WBcel235"],
21+
[annotation: "Canis-lupus-familiaris", fasta: "Canis_lupus_familiaris_circ", assembly: "ROS_Cfam_1.0"],
22+
[annotation: "Cavia-porcellus", fasta: "Cavia_porcellus_circ", assembly: "Cavpor3.0"],
23+
[annotation: "Cercocebus-atys", fasta: "Cercocebus_atys_circ", assembly: "Caty_1.0"],
24+
[annotation: "Callithrix-jacchus", fasta: "Callithrix_jacchus_circ", assembly: "mCalJac1.pat.X"],
25+
[annotation: "Drosophila-melanogaster", fasta: "Drosophila_melanogaster_circ", assembly: "BDGP6"],
26+
[annotation: "Felis-catus", fasta: "Felis_catus_circ", assembly: "Felis_catus_9.0"],
27+
[annotation: "Gallus-gallus", fasta: "Gallus_gallus_circ", assembly: "GRCg7b"],
28+
[annotation: "Homo-sapiens", fasta: "Homo_sapiens_circ", assembly: "GRCh38"],
29+
[annotation: "Macaca-fascicularis", fasta: "Macaca_fascicularis_circ", assembly: "Macaca_fascicularis_6.0"],
30+
[annotation: "Macaca-mulatta", fasta: "Macaca_mulatta_circ", assembly: "Mmul_10"],
31+
[annotation: "Macaca-nemestrina", fasta: "Macaca_nemestrina_circ", assembly: "Mnem_1.0"],
32+
[annotation: "Mus-musculus", fasta: "Mus_musculus_circ", assembly: "GRCm38"],
33+
[annotation: "Oryctolagus-cuniculus", fasta: "Oryctolagus_cuniculus_circ", assembly: "OryCun2.0"],
34+
[annotation: "Ovis-aries", fasta: "Ovis_aries_circ", assembly: "Oar_v3.1"],
35+
[annotation: "Papio-anubis", fasta: "Papio_anubis_circ", assembly: "Panubis1.0"],
36+
[annotation: "Pongo-abelii", fasta: "Pongo_abelii_circ", assembly: "Susie_PABv2"],
37+
[annotation: "Pan-troglodytes", fasta: "Pan_troglodytes_circ", assembly: "pan_Tro_3.0"],
38+
[annotation: "Rattus-norvegicus", fasta: "Rattus_norvegicus_circ", assembly: "mRatBN7.2"],
39+
[annotation: "Sus-scrofa", fasta: "Sus_scrofa_circ", assembly: "Sscrofa11.1"],
40+
]
41+
}
42+
1543
crw {
1644
r2dt_repo = "https://github.com/RNAcentral/R2DT.git"
1745
}

rnacentral_pipeline/cli/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import click
1919

2020
from rnacentral_pipeline.cli import (
21+
circpedia,
2122
context,
2223
cpat,
2324
crw,
@@ -97,6 +98,7 @@ def cli(log_level):
9798
pass
9899

99100

101+
cli.add_command(circpedia.cli)
100102
cli.add_command(context.cli)
101103
cli.add_command(cpat.cli)
102104
cli.add_command(crw.cli)

0 commit comments

Comments
 (0)