GitHub - JaneliaSciComp/jrc-rna-structure-pipeline: Workflow to extract, process, and curate RNA 3D structural data from PDB for structure prediction tasks

Overview

This pipeline automates the extraction and processing of RNA structures from PDB, including:

Fetching RNA structures within a specified date range
Extracting metadata and ligand information (SMILES)
Grouping and clustering sequences to remove redundancy
Calculating structural metrics (structuredness)
Filtering based on quality criteria
Splitting datasets into training and testing sets
Converting to Kaggle competition format

Prerequisites

Pixi package manager
Linux environment (tested on Linux-64)

Installation

1. Install Pixi

If you don't have Pixi installed, follow the instructions at https://pixi.sh/:

curl -fsSL https://pixi.sh/install.sh | bash

2. Set up the environment

Clone this repository and navigate to the project directory:

cd jrc-rna-structure-pipeline

Install dependencies using Pixi:

pixi install

This will create a conda environment with all required dependencies including:

Python packages: pandas, biopython, duckdb
Bioinformatics tools: mmseqs2

3. Activate the environment

pixi shell

Usage

Running the Kaggle Workflow

The easiest way to run the pipeline is using one of the workflow scripts used to generate. Here's an example using the workflow used to compile data for the Stanford RNA 3D Folding Part 2 competition. Note: workflow script is setup to be run in the output directory.

cd /path/to/working/directory
bash /path/to/jrc-rna-structure-pipeline/workflows/kaggle2026_1978-01-01_2025-12-17.sh

This workflow:

Fetches RNA structures from PDB (1978-01-01 to 2025-12-17)
Processes and filters the data
Creates train/test split at 2025-05-29
Generates Kaggle-formatted output in kaggle_jrc_v330/

Workflow Options

The workflow script supports various options for controlling execution. The steps can be setup

# Show help and available options
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --help

# Dry run to see which steps will execute
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --dry-run

# Skip specific steps
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --skip-fetch --skip-cluster

# Run only specific groups
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --only-kaggle

Pipeline Steps

The complete pipeline consists of these major steps:

Fetch Data: Download RNA structures from PDB within date range
Extract Metadata: Parse structure files and extract metadata
Extract Bioassembly: Process biological assemblies
Extract SMILES: Extract ligand information
Group Sequences: Remove 100% identical sequences
Cluster with MMseqs2: Cluster at various identity thresholds (30%, 50%, 70%, 90%)
Calculate Structuredness: Compute structural metrics for each chain
Add Structuredness: Merge structural metrics into metadata
Filter Metadata: Apply quality filters
Select Representatives: Choose representative structures from clusters
Extract Coordinates: Extract 3D coordinates for selected structures
Split Train/Test: Temporal split based on release date
Convert to Kaggle Format: Generate final output files

Output Files

The pipeline generates several output files:

jrc_v330_metadata.csv: Metadata with structuredness metrics
jrc_v330_metadata_filtered.csv: Filtered high-quality structures
train_jrc_v330_bioassembly.csv: Training set metadata
test_jrc_v330_bioassembly.csv: Test set metadata
kaggle_jrc_v330/: Kaggle competition format files
- train_sequences.csv: Training sequences
- test_sequences.csv: Test sequences
- validation_sequences.csv: Validation sequences
- train_labels.csv: Training coordinates
- test_labels.csv: Test coordinates
- validation_labels.csv: Validation coordinates

Workflow backwards compatibility

Currently we do not maintain backwards compatibility between different versions. Ie. the workflow setup for the current version of the repository may not execute correctly with newer version of the repository. We use versioned releases / tags to track the compatible versions.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
metadata_extraction		metadata_extraction
notebooks		notebooks
postprocess		postprocess
processing		processing
tools		tools
validation		validation
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_group_index.py		add_group_index.py
align_sequence.py		align_sequence.py
check_data.py		check_data.py
check_pickle.py		check_pickle.py
check_train_data.py		check_train_data.py
convert_to_multisolution_wide.py		convert_to_multisolution_wide.py
create_simple_train.py		create_simple_train.py
extracted_structures.csv		extracted_structures.csv
filter_by_SS.py		filter_by_SS.py
filter_by_assembly.py		filter_by_assembly.py
filter_by_composition.py		filter_by_composition.py
filter_by_interC1.py		filter_by_interC1.py
filter_by_keywords.py		filter_by_keywords.py
filter_by_temporal_cutoff.py		filter_by_temporal_cutoff.py
generate_train.py		generate_train.py
generate_train_multisolution_long.py		generate_train_multisolution_long.py
get_atom_tokens.py		get_atom_tokens.py
get_pdb_data.py		get_pdb_data.py
get_pub_dates_and_dedupe.py		get_pub_dates_and_dedupe.py
get_single_target_per_group.py		get_single_target_per_group.py
get_xyz_data.py		get_xyz_data.py
groupby_sequence.py		groupby_sequence.py
groupby_sequence_csv.py		groupby_sequence_csv.py
job.sh		job.sh
laplacian_eigen_vector.py		laplacian_eigen_vector.py
match_fasta_sequence.py		match_fasta_sequence.py
merge_metadata.py		merge_metadata.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml
reduce_xyz_precision.py		reduce_xyz_precision.py
requirements.txt		requirements.txt
rna_sequence.py		rna_sequence.py
split_by_cutoff.py		split_by_cutoff.py
split_chains.py		split_chains.py
split_chains_4E8K.py		split_chains_4E8K.py
to_pickle.py		to_pickle.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Prerequisites

Installation

1. Install Pixi

2. Set up the environment

3. Activate the environment

Usage

Running the Kaggle Workflow

Workflow Options

Pipeline Steps

Output Files

Workflow backwards compatibility

About

Uh oh!

Releases 1

Packages

Contributors 2

Languages

License

JaneliaSciComp/jrc-rna-structure-pipeline

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

Installation

1. Install Pixi

2. Set up the environment

3. Activate the environment

Usage

Running the Kaggle Workflow

Workflow Options

Pipeline Steps

Output Files

Workflow backwards compatibility

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages