Skip to content

ssiddhantsharma/RFdiffusion_to_Genie2_Convertor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 

Repository files navigation

RFDiffusion to Genie2 Converter

A Python tool to convert RFDiffusion-style motif specifications to Genie2 format for protein design. This tool is particularly useful for preparing input files for both Genie2's motif scaffolding and SALAD's multi-motif scaffolding.

Installation

git clone https://github.com/ssiddhantsharma/rfdiffusion-to-genie2.git
cd rfdiffusion-to-genie2
chmod +x rfd2genie.py

Basic Usage

Converting a Single PDB File

python rfd2genie.py --pdb_file your_protein.pdb --input "A1-80[M1]/30/[M2]B81-100" --output output_dir --verbose

RFDiffusion Format Syntax

The input format follows this pattern: [N-term extension]/[Chain][Start]-[End][Motif Tag]/[Linker]/[Chain][Start]-[End][Motif Tag]/[C-term extension]

Examples

  • Multiple chains with linker: A1-80[M1]/30/[M2]B81-100
  • Multiple motifs on same chain: A1-29[M1]/30/[M2]A39-50/40/[M3]A2-49
  • Heterodimer (no linker): A1-30[M1]/B89-95[M2]
  • Heterotrimer: A1-30[M1]/B10-40[M2]/C5-25[M3]
  • With N-terminal extension: 20/A1-80[M1]/30/[M2]B81-100
  • With C-terminal extension: A1-80[M1]/30/[M2]B81-100/40
  • With both N and C-terminal extensions: 15/A1-80[M1]/30/[M2]B81-100/25
  • Variable length extensions/linkers: 10-20/A1-80[M1]/25-35/[M2]B81-100/15-25

Format Components

  • Chain: Single letter (e.g., A, B)
  • Residue Range: Start-End (e.g., 1-80)
  • Motif Tag (optional): [M1], [M2], etc.
  • Linker/Extension: Numeric value representing amino acid length or a range (e.g., 30 or 30-40)
  • N-terminal Extension: Number at beginning of format string followed by "/"
  • C-terminal Extension: Number at end of format string preceded by "/"

N-terminal Extensions

Add a number at the beginning of your format string:

python rfd2genie.py --pdb_file your_protein.pdb --input "20/A1-80[M1]/30/[M2]B81-100" --output output_dir --verbose

This will generate a 20-residue N-terminal extension before the first motif.

C-terminal Extensions

Add a number at the end of your format string:

python rfd2genie.py --pdb_file your_protein.pdb --input "A1-80[M1]/30/[M2]B81-100/40" --output output_dir --verbose

This will generate a 40-residue C-terminal extension after the last motif.

Combined Terminal Extensions

Use both N and C terminal extensions:

python rfd2genie.py --pdb_file your_protein.pdb --input "15/A1-80[M1]/30/[M2]B81-100/25" --output output_dir --verbose

This creates a 15-residue N-terminal extension and a 25-residue C-terminal extension.

Variable Length Extensions

For more flexible designs, specify ranges:

python rfd2genie.py --pdb_file your_protein.pdb --input "10-20/A1-80[M1]/30/[M2]B81-100/15-25" --output output_dir --verbose

This allows the N-terminal extension to be 10-20 residues and the C-terminal extension to be 15-25 residues.

Format Conversion: RFDiffusion to Genie2

The Genie2 format requires exact column positioning with proper justification (right/left):

Specification Column Data Justification Data Type
Motif segment 1-16 "REMARK 999 INPUT" - string
19 Chain index of motif segment - string
20-23 Starting residue index right int
24-27 Ending residue index right int
29 Motif group - string
Scaffold segment 1-16 "REMARK 999 INPUT" - string
20-23 Minimum length right int
24-27 Maximum length right int
Minimum sequence length 1-31 "REMARK 999 MINIMUM TOTAL LENGTH" - string
38-40 Minimum sequence length left int
Maximum sequence length 1-31 "REMARK 999 MAXIMUM TOTAL LENGTH" - string
38-40 Maximum sequence length left int

For example, the RFDiffusion format A1-80[M1]/30/[M2]B81-100 converts to:

REMARK 999 NAME   protein_name_motifs
REMARK 999 PDB    protein_name
REMARK 999 INPUT  A   1  80 A
REMARK 999 INPUT     30  35
REMARK 999 INPUT  B  81 100 B
REMARK 999 MINIMUM TOTAL LENGTH      116
REMARK 999 MAXIMUM TOTAL LENGTH      215

With N and C terminal extensions (15/A1-80[M1]/30/[M2]B81-100/25):

REMARK 999 NAME   protein_name_motifs
REMARK 999 PDB    protein_name
REMARK 999 INPUT     15  20
REMARK 999 INPUT  A   1  80 A
REMARK 999 INPUT     30  35
REMARK 999 INPUT  B  81 100 B
REMARK 999 INPUT     25  30
REMARK 999 MINIMUM TOTAL LENGTH      156
REMARK 999 MAXIMUM TOTAL LENGTH      265

Important Genie2 Format Considerations

The converter addresses several aspects of the Genie2 format:

  1. Column Precision: Generates output following Genie2's column-specific formatting requirements with proper justification (right/left). This is absolutely critical for SALAD compatibility.

  2. Group Assignment for Multi-Motif Scaffolding: Assigns groups based on chain identifiers to maintain proper spatial relationships. This is particularly useful for heterodimer, heterotrimer, or other multi-chain complex designs where you want to maintain separate entities:

  • Motifs from the same chain are assigned to the same group
  • Motifs from different chains are assigned to different groups
  • First chain typically gets group "A", second chain gets "B", and so on
  1. PDB Name Integration: Places the PDB name only in the REMARK 999 PDB line, not in motif segment definitions.

  2. Residue Validation: Verifies that specified residues exist in the PDB structure.

  3. Residue Reordering: Automatically reorders residues in the PDB file to match the order specified in the RFDiffusion format string, ensuring compatibility with Genie2's requirements.

Scaffold Lengths

Scaffold lengths are directly controlled through the input format:

  • A single number (e.g., 30) will generate a scaffold of exactly that length
  • A range (e.g., 20-40) will allow SALAD/Genie2 to select a length within that range
  • The scaffold lengths are determined by your input format, not automatically calculated

Chain Identity Preservation

When working with multiple chains, the script ensures:

  • Each chain is assigned a unique group identifier
  • Chains are kept separate in the output structure
  • No unwanted linkers are added between different chains unless specified

Processing Multiple Files with Different Specifications

Create a CSV file (e.g., specs.csv):

pdb_file,rf_format
protein1.pdb,A1-80[M1]/30/[M2]B81-100
protein2.pdb,A1-29[M1]/30/[M2]A39-50
protein3.pdb,15/A1-30[M1]/25/B10-40[M2]/20

Then run:

python rfd2genie.py --pdb_dir pdb_directory --csv specs.csv --output output_dir --verbose

Running with Genie2

After converting your files, you can directly use them with Genie2 for motif scaffolding:

  1. Place your converted PDB files in the appropriate Genie2 data directory (e.g., data/multimotifs for multi-motif scaffolding)

  2. Run Genie2's scaffolding algorithm:

    python genie/sample_scaffold.py --name base --epoch 30 --scale 0.4 --outdir results/my_designs --datadir my_converted_pdbs --num_samples 100 

    Key parameters:

    • --name: Model name (default: "base" for Genie2's base model)
    • --epoch: Model epoch (e.g., 30)
    • --scale: Sampling noise scale (between 0 and 1)
    • --outdir: Output directory for results
    • --datadir: Directory containing your converted PDB files
    • --num_samples: Number of designs to generate per motif (default: 100)
  3. Your results will be stored in the specified output directory.

Multi-Motif Scaffolding Example

For multi-motif scaffolding with Genie2, run:

# First convert your PDB files
python rfd2genie.py --pdb_dir my_motifs --input "A1-80[M1]/30/[M2]B81-100" --output genie2_pdbs --verbose

# Then run Genie2 on the converted files
python genie/sample_scaffold.py --name base --epoch 30 --scale 0.4 --outdir results/my_designs --datadir genie2_pdbs --num_samples 1000

Running with SALAD

The converter ensures optimal compatibility with SALAD by implementing several critical fixes:

  1. Exact Formatting: The script strictly adheres to the column-specific formatting requirements of Genie2, with precise spacing, justification, and positioning. This is crucial because SALAD uses exact column positions for parsing, and even a small formatting deviation can cause hours-long processing or failure.

  2. Full Chain Coverage: Ensures motif definitions cover entire chains rather than just segments, which resolves the common "boolean index did not match shape of indexed array" error in SALAD. This fix prevents SALAD from running indefinitely or for extremely long periods.

  3. Consistent Residue Numbering: Residues are renumbered to be continuous from 1 to N for each chain, which is what SALAD expects. This eliminates array size mismatches.

  4. Chain and Group Consistency: Ensures chain IDs in motif definitions match those in the PDB structure, and that motif groups are assigned correctly.

SALAD Multi-Motif Scaffolding Example:

  1. Convert your PDB file to Genie2 format:

    python rfd2genie.py --pdb_file your_protein.pdb --input "15/A1-80[M1]/30/[M2]B81-100/25" --output genie2_pdbs --verbose
  2. Use the converted file with SALAD:

    python salad/training/eval_motif_benchmark.py \
        --config multimotif_vp \
        --params params/multimotif_vp-200k.jax \
        --out_path designed_proteins/ \
        --num_steps 500 --out_steps 400 --prev_threshold 0.8 \
        --num_designs 10 --timescale_pos "cosine(t)" \
        --template genie2_pdbs/your_protein_genie2.pdb

About

RFdiffusion to Genie2 converter for pdb formats

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages