This pipeline automates the extraction and processing of RNA structures from PDB, including:
- Fetching RNA structures within a specified date range
- Extracting metadata and ligand information (SMILES)
- Grouping and clustering sequences to remove redundancy
- Calculating structural metrics (structuredness)
- Filtering based on quality criteria
- Splitting datasets into training and testing sets
- Converting to Kaggle competition format
- Pixi package manager
- Linux environment (tested on Linux-64)
If you don't have Pixi installed, follow the instructions at https://pixi.sh/:
curl -fsSL https://pixi.sh/install.sh | bashClone this repository and navigate to the project directory:
cd jrc-rna-structure-pipelineInstall dependencies using Pixi:
pixi installThis will create a conda environment with all required dependencies including:
- Python packages: pandas, biopython, duckdb
- Bioinformatics tools: mmseqs2
pixi shellThe easiest way to run the pipeline is using one of the workflow scripts used to generate. Here's an example using the workflow used to compile data for the Stanford RNA 3D Folding Part 2 competition. Note: workflow script is setup to be run in the output directory.
cd /path/to/working/directory
bash /path/to/jrc-rna-structure-pipeline/workflows/kaggle2026_1978-01-01_2025-12-17.shThis workflow:
- Fetches RNA structures from PDB (1978-01-01 to 2025-12-17)
- Processes and filters the data
- Creates train/test split at 2025-05-29
- Generates Kaggle-formatted output in
kaggle_jrc_v330/
The workflow script supports various options for controlling execution. The steps can be setup
# Show help and available options
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --help
# Dry run to see which steps will execute
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --dry-run
# Skip specific steps
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --skip-fetch --skip-cluster
# Run only specific groups
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --only-kaggleThe complete pipeline consists of these major steps:
- Fetch Data: Download RNA structures from PDB within date range
- Extract Metadata: Parse structure files and extract metadata
- Extract Bioassembly: Process biological assemblies
- Extract SMILES: Extract ligand information
- Group Sequences: Remove 100% identical sequences
- Cluster with MMseqs2: Cluster at various identity thresholds (30%, 50%, 70%, 90%)
- Calculate Structuredness: Compute structural metrics for each chain
- Add Structuredness: Merge structural metrics into metadata
- Filter Metadata: Apply quality filters
- Select Representatives: Choose representative structures from clusters
- Extract Coordinates: Extract 3D coordinates for selected structures
- Split Train/Test: Temporal split based on release date
- Convert to Kaggle Format: Generate final output files
The pipeline generates several output files:
jrc_v330_metadata.csv: Metadata with structuredness metricsjrc_v330_metadata_filtered.csv: Filtered high-quality structurestrain_jrc_v330_bioassembly.csv: Training set metadatatest_jrc_v330_bioassembly.csv: Test set metadatakaggle_jrc_v330/: Kaggle competition format filestrain_sequences.csv: Training sequencestest_sequences.csv: Test sequencesvalidation_sequences.csv: Validation sequencestrain_labels.csv: Training coordinatestest_labels.csv: Test coordinatesvalidation_labels.csv: Validation coordinates
Currently we do not maintain backwards compatibility between different versions. Ie. the workflow setup for the current version of the repository may not execute correctly with newer version of the repository. We use versioned releases / tags to track the compatible versions.