|
| 1 | +## Overview |
| 2 | + |
| 3 | +This pipeline automates the extraction and processing of RNA structures from PDB, including: |
| 4 | + |
| 5 | +- Fetching RNA structures within a specified date range |
| 6 | +- Extracting metadata and ligand information (SMILES) |
| 7 | +- Grouping and clustering sequences to remove redundancy |
| 8 | +- Calculating structural metrics (structuredness) |
| 9 | +- Filtering based on quality criteria |
| 10 | +- Splitting datasets into training and testing sets |
| 11 | +- Converting to Kaggle competition format |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +- [Pixi](https://pixi.sh/) package manager |
| 16 | +- Linux environment (tested on Linux-64) |
| 17 | + |
| 18 | +## Installation |
| 19 | + |
| 20 | +### 1. Install Pixi |
| 21 | + |
| 22 | +If you don't have Pixi installed, follow the instructions at [https://pixi.sh/](https://pixi.sh/): |
| 23 | + |
| 24 | +```bash |
| 25 | +curl -fsSL https://pixi.sh/install.sh | bash |
| 26 | +``` |
| 27 | + |
| 28 | +### 2. Set up the environment |
| 29 | + |
| 30 | +Clone this repository and navigate to the project directory: |
| 31 | + |
| 32 | +```bash |
| 33 | +cd jrc-rna-structure-pipeline |
| 34 | +``` |
| 35 | + |
| 36 | +Install dependencies using Pixi: |
| 37 | + |
| 38 | +```bash |
| 39 | +pixi install |
| 40 | +``` |
| 41 | + |
| 42 | +This will create a conda environment with all required dependencies including: |
| 43 | + |
| 44 | +- Python packages: pandas, biopython, duckdb |
| 45 | +- Bioinformatics tools: mmseqs2 |
| 46 | + |
| 47 | +### 3. Activate the environment |
| 48 | + |
| 49 | +```bash |
| 50 | +pixi shell |
| 51 | +``` |
| 52 | + |
| 53 | +## Usage |
| 54 | + |
| 55 | +### Running the Kaggle Workflow |
| 56 | + |
| 57 | +The easiest way to run the pipeline is using one of the workflow scripts used to generate. Here's an example using the workflow used to compile data for the |
| 58 | +[Stanford RNA 3D Folding Part 2](https://www.kaggle.com/competitions/stanford-rna-3d-folding-2) competition. Note: workflow script is setup to be run in the output directory. |
| 59 | + |
| 60 | +```bash |
| 61 | +cd /path/to/working/directory |
| 62 | +bash /path/to/jrc-rna-structure-pipeline/workflows/kaggle2026_1978-01-01_2025-12-17.sh |
| 63 | +``` |
| 64 | + |
| 65 | +This workflow: |
| 66 | + |
| 67 | +- Fetches RNA structures from PDB (1978-01-01 to 2025-12-17) |
| 68 | +- Processes and filters the data |
| 69 | +- Creates train/test split at 2025-05-29 |
| 70 | +- Generates Kaggle-formatted output in `kaggle_jrc_v330/` |
| 71 | + |
| 72 | +### Workflow Options |
| 73 | + |
| 74 | +The workflow script supports various options for controlling execution. The steps can be setup |
| 75 | + |
| 76 | +```bash |
| 77 | +# Show help and available options |
| 78 | +bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --help |
| 79 | + |
| 80 | +# Dry run to see which steps will execute |
| 81 | +bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --dry-run |
| 82 | + |
| 83 | +# Skip specific steps |
| 84 | +bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --skip-fetch --skip-cluster |
| 85 | + |
| 86 | +# Run only specific groups |
| 87 | +bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --only-kaggle |
| 88 | +``` |
| 89 | + |
| 90 | +### Pipeline Steps |
| 91 | + |
| 92 | +The complete pipeline consists of these major steps: |
| 93 | + |
| 94 | +1. **Fetch Data**: Download RNA structures from PDB within date range |
| 95 | +2. **Extract Metadata**: Parse structure files and extract metadata |
| 96 | +3. **Extract Bioassembly**: Process biological assemblies |
| 97 | +4. **Extract SMILES**: Extract ligand information |
| 98 | +5. **Group Sequences**: Remove 100% identical sequences |
| 99 | +6. **Cluster with MMseqs2**: Cluster at various identity thresholds (30%, 50%, 70%, 90%) |
| 100 | +7. **Calculate Structuredness**: Compute structural metrics for each chain |
| 101 | +8. **Add Structuredness**: Merge structural metrics into metadata |
| 102 | +9. **Filter Metadata**: Apply quality filters |
| 103 | +10. **Select Representatives**: Choose representative structures from clusters |
| 104 | +11. **Extract Coordinates**: Extract 3D coordinates for selected structures |
| 105 | +12. **Split Train/Test**: Temporal split based on release date |
| 106 | +13. **Convert to Kaggle Format**: Generate final output files |
| 107 | + |
| 108 | +### Output Files |
| 109 | + |
| 110 | +The pipeline generates several output files: |
| 111 | + |
| 112 | +- `jrc_v330_metadata.csv`: Metadata with structuredness metrics |
| 113 | +- `jrc_v330_metadata_filtered.csv`: Filtered high-quality structures |
| 114 | +- `train_jrc_v330_bioassembly.csv`: Training set metadata |
| 115 | +- `test_jrc_v330_bioassembly.csv`: Test set metadata |
| 116 | +- `kaggle_jrc_v330/`: Kaggle competition format files |
| 117 | + - `train_sequences.csv`: Training sequences |
| 118 | + - `test_sequences.csv`: Test sequences |
| 119 | + - `validation_sequences.csv`: Validation sequences |
| 120 | + - `train_labels.csv`: Training coordinates |
| 121 | + - `test_labels.csv`: Test coordinates |
| 122 | + - `validation_labels.csv`: Validation coordinates |
| 123 | + |
| 124 | +## Workflow backwards compatibility |
| 125 | + |
| 126 | +Currently we do not maintain backwards compatibility between different versions. Ie. the workflow setup for the current version of the repository may not execute correctly |
| 127 | +with newer version of the repository. We use versioned releases to track the compatible versions. |
0 commit comments