Skip to content

Commit 9916033

Browse files
committed
Added README and LICENSE
1 parent 3907aa8 commit 9916033

File tree

3 files changed

+155
-12
lines changed

3 files changed

+155
-12
lines changed

LICENSE

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) 2026, Howard Hughes Medical Institute
4+
5+
Redistribution and use in source and binary forms, with or without
6+
modification, are permitted provided that the following conditions are met:
7+
8+
1. Redistributions of source code must retain the above copyright notice, this
9+
list of conditions and the following disclaimer.
10+
11+
2. Redistributions in binary form must reproduce the above copyright notice,
12+
this list of conditions and the following disclaimer in the documentation
13+
and/or other materials provided with the distribution.
14+
15+
3. Neither the name of the copyright holder nor the names of its
16+
contributors may be used to endorse or promote products derived from
17+
this software without specific prior written permission.
18+
19+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
## Overview
2+
3+
This pipeline automates the extraction and processing of RNA structures from PDB, including:
4+
5+
- Fetching RNA structures within a specified date range
6+
- Extracting metadata and ligand information (SMILES)
7+
- Grouping and clustering sequences to remove redundancy
8+
- Calculating structural metrics (structuredness)
9+
- Filtering based on quality criteria
10+
- Splitting datasets into training and testing sets
11+
- Converting to Kaggle competition format
12+
13+
## Prerequisites
14+
15+
- [Pixi](https://pixi.sh/) package manager
16+
- Linux environment (tested on Linux-64)
17+
18+
## Installation
19+
20+
### 1. Install Pixi
21+
22+
If you don't have Pixi installed, follow the instructions at [https://pixi.sh/](https://pixi.sh/):
23+
24+
```bash
25+
curl -fsSL https://pixi.sh/install.sh | bash
26+
```
27+
28+
### 2. Set up the environment
29+
30+
Clone this repository and navigate to the project directory:
31+
32+
```bash
33+
cd jrc-rna-structure-pipeline
34+
```
35+
36+
Install dependencies using Pixi:
37+
38+
```bash
39+
pixi install
40+
```
41+
42+
This will create a conda environment with all required dependencies including:
43+
44+
- Python packages: pandas, biopython, duckdb
45+
- Bioinformatics tools: mmseqs2
46+
47+
### 3. Activate the environment
48+
49+
```bash
50+
pixi shell
51+
```
52+
53+
## Usage
54+
55+
### Running the Kaggle Workflow
56+
57+
The easiest way to run the pipeline is using one of the workflow scripts used to generate. Here's an example using the workflow used to compile data for the
58+
[Stanford RNA 3D Folding Part 2](https://www.kaggle.com/competitions/stanford-rna-3d-folding-2) competition. Note: workflow script is setup to be run in the output directory.
59+
60+
```bash
61+
cd /path/to/working/directory
62+
bash /path/to/jrc-rna-structure-pipeline/workflows/kaggle2026_1978-01-01_2025-12-17.sh
63+
```
64+
65+
This workflow:
66+
67+
- Fetches RNA structures from PDB (1978-01-01 to 2025-12-17)
68+
- Processes and filters the data
69+
- Creates train/test split at 2025-05-29
70+
- Generates Kaggle-formatted output in `kaggle_jrc_v330/`
71+
72+
### Workflow Options
73+
74+
The workflow script supports various options for controlling execution. The steps can be setup
75+
76+
```bash
77+
# Show help and available options
78+
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --help
79+
80+
# Dry run to see which steps will execute
81+
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --dry-run
82+
83+
# Skip specific steps
84+
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --skip-fetch --skip-cluster
85+
86+
# Run only specific groups
87+
bash workflows/kaggle2026_1978-01-01_2025-12-17.sh --only-kaggle
88+
```
89+
90+
### Pipeline Steps
91+
92+
The complete pipeline consists of these major steps:
93+
94+
1. **Fetch Data**: Download RNA structures from PDB within date range
95+
2. **Extract Metadata**: Parse structure files and extract metadata
96+
3. **Extract Bioassembly**: Process biological assemblies
97+
4. **Extract SMILES**: Extract ligand information
98+
5. **Group Sequences**: Remove 100% identical sequences
99+
6. **Cluster with MMseqs2**: Cluster at various identity thresholds (30%, 50%, 70%, 90%)
100+
7. **Calculate Structuredness**: Compute structural metrics for each chain
101+
8. **Add Structuredness**: Merge structural metrics into metadata
102+
9. **Filter Metadata**: Apply quality filters
103+
10. **Select Representatives**: Choose representative structures from clusters
104+
11. **Extract Coordinates**: Extract 3D coordinates for selected structures
105+
12. **Split Train/Test**: Temporal split based on release date
106+
13. **Convert to Kaggle Format**: Generate final output files
107+
108+
### Output Files
109+
110+
The pipeline generates several output files:
111+
112+
- `jrc_v330_metadata.csv`: Metadata with structuredness metrics
113+
- `jrc_v330_metadata_filtered.csv`: Filtered high-quality structures
114+
- `train_jrc_v330_bioassembly.csv`: Training set metadata
115+
- `test_jrc_v330_bioassembly.csv`: Test set metadata
116+
- `kaggle_jrc_v330/`: Kaggle competition format files
117+
- `train_sequences.csv`: Training sequences
118+
- `test_sequences.csv`: Test sequences
119+
- `validation_sequences.csv`: Validation sequences
120+
- `train_labels.csv`: Training coordinates
121+
- `test_labels.csv`: Test coordinates
122+
- `validation_labels.csv`: Validation coordinates
123+
124+
## Workflow backwards compatibility
125+
126+
Currently we do not maintain backwards compatibility between different versions. Ie. the workflow setup for the current version of the repository may not execute correctly
127+
with newer version of the repository. We use versioned releases to track the compatible versions.

readme.md

Lines changed: 0 additions & 12 deletions
This file was deleted.

0 commit comments

Comments
 (0)