Skip to content

Commit 46050b7

Browse files
authored
Merge pull request #50 from seqeralabs/seqera-ai/20251122-040714-fix-proteinmpnn-execution-count
Fix ProteinMPNN execution count and clarify target sequence extraction
2 parents 8167676 + 8d1c71f commit 46050b7

File tree

2 files changed

+168
-11
lines changed

2 files changed

+168
-11
lines changed

FIXES_SUMMARY.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Fixes Summary: ProteinMPNN Execution Count & EXTRACT_TARGET_SEQUENCES
2+
3+
## Date: 2025-11-22
4+
5+
---
6+
7+
## Issue 1: ProteinMPNN Running 5 Times Instead of 2
8+
9+
### Problem
10+
ProteinMPNN was executing 5 times even though `params.budget = 2` (meaning only 2 budget designs should exist).
11+
12+
### Root Cause
13+
The input channel for `CONVERT_CIF_TO_PDB` was incorrectly configured:
14+
15+
**BEFORE (INCORRECT):**
16+
```groovy
17+
ch_structures_for_conversion = BOLTZGEN_RUN.out.results
18+
.map { meta, results_dir ->
19+
def budget_designs_dir = file("${results_dir}/intermediate_designs_inverse_folded")
20+
[meta, budget_designs_dir]
21+
}
22+
```
23+
24+
This was passing the **entire results directory** which contains:
25+
- `final_1_designs/` (1 structure)
26+
- `intermediate_ranked_10_designs/` (10 structures)
27+
- `intermediate_designs_inverse_folded/` (2 structures - budget designs)
28+
- `intermediate_designs/` (10 structures)
29+
- etc.
30+
31+
So it was converting **ALL structures** from multiple subdirectories, not just the budget designs.
32+
33+
### Solution
34+
Changed to use the dedicated `budget_design_cifs` output from BOLTZGEN_RUN:
35+
36+
**AFTER (CORRECT):**
37+
```groovy
38+
CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs)
39+
```
40+
41+
The `budget_design_cifs` output is specifically curated to contain **ONLY** the budget designs (2 structures when budget=2).
42+
43+
### Expected Behavior Now
44+
- ProteinMPNN will run **exactly 2 times** (once per budget design)
45+
- Each execution processes one PDB structure
46+
- Downstream Protenix refolding inherits the same parallelization
47+
48+
---
49+
50+
## Issue 2: EXTRACT_TARGET_SEQUENCES - Purpose and Naming
51+
52+
### What Does EXTRACT_TARGET_SEQUENCES Do?
53+
54+
This process extracts the **target protein sequence** (binding partner) from the original Boltzgen-designed structures.
55+
56+
### Why Do We Need It?
57+
58+
**Context:** When ProteinMPNN generates new sequences for the binder protein, we want to refold those sequences with Protenix to verify they maintain the correct structure.
59+
60+
**Problem:** Protenix needs to know which chain is the **target** (the protein you're designing a binder against) so it can:
61+
1. Keep the target chain in its correct position during refolding
62+
2. Properly model the binder-target interaction
63+
3. Generate accurate confidence scores for the complex
64+
65+
**Solution:** Extract the target sequence from the original Boltzgen structures and pass it to Protenix along with the new ProteinMPNN sequences.
66+
67+
### What Is the Target Sequence?
68+
69+
In a binder design workflow:
70+
- **Binder chain**: The small protein you're designing (gets optimized by ProteinMPNN)
71+
- **Target chain**: The larger protein you want to bind to (stays fixed, extracted by this process)
72+
73+
### Process Flow
74+
75+
```
76+
Boltzgen Structure (CIF)
77+
78+
EXTRACT_TARGET_SEQUENCES
79+
80+
Target Sequence (TXT file)
81+
82+
├→ Protenix Input 1: ProteinMPNN optimized sequence (binder)
83+
└→ Protenix Input 2: Target sequence (from this extraction)
84+
85+
Protenix Refolds Complex
86+
```
87+
88+
### Naming and No Collisions
89+
90+
**Process name**: `EXTRACT_TARGET_SEQUENCES` (unique, no collision)
91+
**Module file**: `modules/local/extract_target_sequences.nf`
92+
**Script file**: `assets/extract_target_sequence.py`
93+
**Output files**: `${meta.id}_target_sequences.txt` (unique per design)
94+
95+
No naming collisions exist - this process has a distinct name and purpose separate from:
96+
- `PROTEINMPNN_OPTIMIZE` (optimizes binder sequences)
97+
- `PROTENIX_REFOLD` (refolds optimized binders with target)
98+
- `EXTRACT_*` other modules (none exist)
99+
100+
### Example Output
101+
102+
For design `insulin_binder`:
103+
```
104+
insulin_binder_target_sequences.txt:
105+
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
106+
```
107+
108+
This is the insulin sequence (target) that will be passed to Protenix along with the ProteinMPNN-optimized binder sequences.
109+
110+
---
111+
112+
## Summary of Changes
113+
114+
### Files Modified
115+
1. `workflows/protein_design.nf`
116+
- Fixed ProteinMPNN input channel to use `budget_design_cifs`
117+
- Added comprehensive documentation for `EXTRACT_TARGET_SEQUENCES` step
118+
119+
### Expected Impact
120+
- ProteinMPNN executions: ~~5~~**2** (correct)
121+
- Protenix executions: Proportional reduction (2 × num_seq_per_target)
122+
- Clearer understanding of target sequence extraction purpose
123+
- No naming collisions or process conflicts
124+
125+
### Testing Recommendation
126+
Run with minimal parameters to verify:
127+
```bash
128+
nextflow run main.nf \
129+
--designs test_designs/ \
130+
--budget 2 \
131+
--run_proteinmpnn true \
132+
--run_protenix_refold true \
133+
--mpnn_num_seq_per_target 3
134+
```
135+
136+
Expected process counts:
137+
- BOLTZGEN_RUN: 1 (per design YAML)
138+
- CONVERT_CIF_TO_PDB: 1 (processes 2 budget CIFs)
139+
- PROTEINMPNN_OPTIMIZE: 2 (once per budget design)
140+
- EXTRACT_TARGET_SEQUENCES: 1 (once per design)
141+
- PROTENIX_REFOLD: 6 (2 budget designs × 3 sequences each)
142+
143+
---
144+
145+
## Documentation Added
146+
147+
Enhanced inline documentation in workflow to explain:
148+
1. Why we use `budget_design_cifs` not `results` directory
149+
2. Purpose of target sequence extraction
150+
3. How target sequence is used by Protenix
151+
4. Expected parallelization pattern
152+
153+
This should prevent future confusion about:
154+
- Which structures feed into ProteinMPNN
155+
- Why we extract target sequences separately
156+
- How the binder-target complex is modeled

workflows/protein_design.nf

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -37,15 +37,9 @@ workflow PROTEIN_DESIGN {
3737
// ========================================================================
3838
if (params.run_proteinmpnn) {
3939
// Step 1: Convert CIF structures to PDB format (ProteinMPNN requires PDB)
40-
// Prepare input channel with structures from Boltzgen budget designs (intermediate_designs_inverse_folded)
41-
// These are the same structures that IPSAE and PRODIGY analyze
42-
ch_structures_for_conversion = BOLTZGEN_RUN.out.results
43-
.map { meta, results_dir ->
44-
def budget_designs_dir = file("${results_dir}/intermediate_designs_inverse_folded")
45-
[meta, budget_designs_dir]
46-
}
47-
48-
CONVERT_CIF_TO_PDB(ch_structures_for_conversion)
40+
// Use budget_design_cifs which contains ONLY the budget designs (e.g., 2 structures if budget=2)
41+
// NOT all designs from results directory
42+
CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs)
4943

5044
// Step 2: Parallelize ProteinMPNN - run separately for each budget design
5145
// Use flatMap to create individual tasks per PDB file (one per budget iteration)
@@ -72,13 +66,20 @@ workflow PROTEIN_DESIGN {
7266
ch_final_designs_for_analysis = PROTEINMPNN_OPTIMIZE.out.optimized_designs
7367

7468
// ====================================================================
75-
// Step 3: Protenix refolding if enabled
69+
// Step 3: Extract target sequences for Protenix refolding
70+
// ====================================================================
71+
// PURPOSE: Extract the TARGET sequence (binding partner) from Boltzgen structures
72+
// WHY: Protenix needs to know which chain is the target (to keep fixed) when
73+
// refolding ProteinMPNN-optimized binder sequences
74+
// WHAT: Reads original Boltzgen CIF files and extracts the target chain sequence
75+
// OUTPUT: Plain text file with target sequence (one per design)
7676
// ====================================================================
7777
if (params.run_protenix_refold) {
7878
// Prepare extraction script as a channel
7979
ch_extract_script = Channel.fromPath("${projectDir}/assets/extract_target_sequence.py", checkIfExists: true)
8080

81-
// Extract target sequences from Boltzgen structures
81+
// Extract target sequences from Boltzgen final structures
82+
// Use final_cifs which contains one representative structure per design
8283
ch_boltzgen_structures = BOLTZGEN_RUN.out.final_cifs
8384
EXTRACT_TARGET_SEQUENCES(ch_boltzgen_structures, ch_extract_script)
8485

0 commit comments

Comments
 (0)