gRNAde is trained on RNA structures from the PDB at ≤4A resolution downloaded via RNASolo with date cutoff: 31 October 2023.
All the data is available on HuggingFace: https://huggingface.co/chaitjo/gRNAde_datasets
The processed dataset can be downloaded from HuggingFace:
# Ensure you are in the base directory
cd ~/geometric-rna-design/
# Download processed data using HuggingFace CLI (or manually)
hf download chaitjo/gRNAde_datasets RNASolo_31102023_processed.pt --local-dir data/ --repo-type datasetEach unique RNA sequence is available in the following format (most of the metadata is optional for using gRNAde):
{
'sequence' # RNA sequence as a string
'id_list' # list of PDB IDs
'coords_list' # list of structures, i.e. 3D coordinates of shape ``(length, 27, 3)``
'sec_struct_list' # list of secondary structure strings in dotbracket notation
'sasa_list' # list of per-nucleotide SASA values
'rfam_list' # list of RFAM family IDs
'eq_class_list' # list of non-redundant equivalence class IDs
'type_list' # list of structure types (RNA-only, RNA-protein complex, etc.)
'rmsds_list' # dictionary of pairwise C4' RMSD values between structures
'cluster_seqid0.8' # cluster ID of sequence identity clustering at 80%
'cluster_structsim0.45' # cluster ID of structure similarity clustering at 45%
}
To download all the 3D structured from RNASolo with 31 October 2023 cutoff as raw PDB files using HuggingFace:
# Ensure you are in the base directory
cd ~/geometric-rna-design/
# Download raw data using HuggingFace CLI (or manually)
hf download chaitjo/gRNAde_datasets RNASolo_31102023_raw.tar.gz --local-dir data/ --repo-type dataset
tar -zxvf RNASolo_31102023_raw.tar.gzThis data was downloaded from the RNASolo website on 31 October 2023: 3D (PDB) + all molecules + all members + res. ≤4.0 - https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_4_0__3_300.zip (currently not working)
To process the raw PDB files into an ML-ready format (same as RNASolo_31102023_processed.pt), install the optional dependencies (US-align, CD-HIT) and run the script data/process_data.py.
The current codebase houses code for the stable, experimentally validated version of gRNAde (v1.0.0) used in "Generative inverse design of RNA structure and function with gRNAde".
The code used for the ICLR 2025 Spotlight paper is available as release v0.3.2: https://github.com/chaitjo/geometric-rna-design/releases/tag/v0.3.2
We continue to provided the splits used in our ICLR 2025 experiments in the data/ directory:
- Single-state split from Das et al., 2010:
data/das_split.pt(called the Das split for compatibility with older code) - Multi-state split of structurally flexible RNAs:
data/structsim_split_v2.pt(Note that we have deprecated an older version of the multi-state split)
The exact PDB IDs used for each of the splits are also available in the data/split_ids/ directory, in case you are using a different version of RNAsolo after the 31 October 2023 cutoff.