Skip to content

Commit 4140962

Browse files
authored
Merge pull request #62 from seqeralabs/seqera-ai/20251127-201522-fix-boltz2-parallelization
Fix Boltz2 parallelization to process all budget designs
2 parents f267e39 + 4685b64 commit 4140962

File tree

5 files changed

+534
-4
lines changed

5 files changed

+534
-4
lines changed

CHANGES_SUMMARY.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Summary of Parallelization Fixes
2+
3+
## Overview
4+
Fixed the parallelization issue where Boltz2 refolding was only processing sequences from one budget design instead of all budget designs in parallel.
5+
6+
## Changes Made
7+
8+
### 1. workflows/protein_design.nf (lines 108-120)
9+
**Changed**: `join()``combine(by: 0)`
10+
11+
**Before:**
12+
```groovy
13+
.join(
14+
EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq ->
15+
[meta.id, seq]
16+
}
17+
)
18+
```
19+
20+
**After:**
21+
```groovy
22+
.combine(
23+
EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq ->
24+
[meta.id, seq]
25+
},
26+
by: 0 // Combine by parent_id (index 0)
27+
)
28+
```
29+
30+
**Why**: `join()` only matches the first item with the same key, dropping all subsequent items. `combine()` matches ALL items with the same key, ensuring all ProteinMPNN sequences from all budget designs are paired with the target sequence.
31+
32+
### 2. modules/local/boltz2_refold.nf (around line 117)
33+
**Added**: Skip first sequence in ProteinMPNN FASTA files
34+
35+
**Before:**
36+
```python
37+
print(f"Found {len(sequences)} sequences in {fasta_file}")
38+
39+
# Create Boltz-2 YAML for each sequence
40+
for idx, (header, binder_seq) in enumerate(sequences):
41+
```
42+
43+
**After:**
44+
```python
45+
print(f"Found {len(sequences)} sequences in {fasta_file}")
46+
47+
# Skip the first sequence (it's always the original sequence from Boltzgen)
48+
# We only want to refold the NEW sequences generated by ProteinMPNN
49+
sequences_to_process = sequences[1:] if len(sequences) > 1 else []
50+
51+
if not sequences_to_process:
52+
print(f"⚠ Warning: Only found 1 sequence (original), no new MPNN sequences to refold")
53+
continue
54+
55+
print(f"Processing {len(sequences_to_process)} new MPNN sequences (skipping original)")
56+
57+
# Create Boltz-2 YAML for each NEW sequence (skip first one)
58+
for idx, (header, binder_seq) in enumerate(sequences_to_process):
59+
```
60+
61+
**Why**: ProteinMPNN always includes the original Boltzgen sequence as the first entry in each FASTA file. Since we already have the Boltzgen structure, we don't need to refold it again.
62+
63+
## Documentation Added
64+
65+
1. **PARALLELIZATION_FIX.md**: Comprehensive explanation of the problem, root cause, solution, and testing recommendations
66+
67+
2. **WORKFLOW_FLOW_DIAGRAM.md**: Visual diagram showing the complete pipeline flow with all parallelization points
68+
69+
3. **CHANGES_SUMMARY.md**: This file - quick reference for what changed
70+
71+
## Expected Behavior After Fix
72+
73+
### Example: budget=2, mpnn_num_seq_per_target=8
74+
75+
**Step 1: Boltzgen**
76+
- Generates 5 designs, filters to top 2
77+
- Output: `rank_1.cif`, `rank_2.cif`
78+
79+
**Step 2: ProteinMPNN** (2 parallel executions)
80+
- Process `rank_1.pdb` → 8 sequences (1 original + 7 new)
81+
- Process `rank_2.pdb` → 8 sequences (1 original + 7 new)
82+
- Total: 16 FASTA files
83+
84+
**Step 3: Extract Target Sequences** (1 execution)
85+
- Extracts target chain from first Boltzgen CIF
86+
- Output: `target_seq.txt`
87+
88+
**Step 4: Boltz2 Refolding** (14 parallel executions) ← FIXED!
89+
- From rank_1: 7 new sequences (skip original)
90+
- From rank_2: 7 new sequences (skip original)
91+
- Total: 14 Boltz2 predictions
92+
93+
**Step 5: Analysis** (16 parallel executions each)
94+
- ipSAE: 2 Boltzgen + 14 Boltz2 = 16 scores
95+
- Prodigy: 2 Boltzgen + 14 Boltz2 = 16 scores
96+
- Foldseek: 2 Boltzgen + 14 Boltz2 = 16 searches
97+
98+
**Step 6: Consolidation** (1 execution)
99+
- Combines all metrics into one table
100+
- Output: 16 rows (2 Boltzgen + 14 Boltz2)
101+
102+
## Impact
103+
104+
### Before Fix:
105+
- ❌ Only sequences from first budget design were refolded
106+
- ❌ Example: 2 Boltzgen + 7 Boltz2 = 9 total structures
107+
- ❌ Missing half the expected data
108+
109+
### After Fix:
110+
- ✅ ALL sequences from ALL budget designs are refolded
111+
- ✅ Example: 2 Boltzgen + 14 Boltz2 = 16 total structures
112+
- ✅ Complete parallelization as intended
113+
114+
## Testing Checklist
115+
116+
- [ ] Test with budget=1 (verify ProteinMPNN + Boltz2 work)
117+
- [ ] Test with budget=2 (verify all designs processed)
118+
- [ ] Test with budget=5 (verify scales correctly)
119+
- [ ] Check logs for "Processing X new MPNN sequences (skipping original)"
120+
- [ ] Verify final metrics table has expected row count:
121+
- Expected rows = budget + [budget × (mpnn_num_seq_per_target - 1)]
122+
- Example: budget=2, mpnn_num_seq_per_target=8 → 2 + (2 × 7) = 16 rows
123+
- [ ] Verify Boltz2 skips first sequence in each FASTA
124+
- [ ] Verify all budget designs contribute to final results
125+
126+
## Files Modified
127+
128+
1. `workflows/protein_design.nf` - Channel operations fix
129+
2. `modules/local/boltz2_refold.nf` - Skip first sequence logic
130+
131+
## Files Added
132+
133+
1. `PARALLELIZATION_FIX.md` - Detailed technical explanation
134+
2. `WORKFLOW_FLOW_DIAGRAM.md` - Visual flow diagram
135+
3. `CHANGES_SUMMARY.md` - This file
136+
137+
## Questions?
138+
139+
See `PARALLELIZATION_FIX.md` for detailed explanations and `WORKFLOW_FLOW_DIAGRAM.md` for visual representation of the complete workflow.

PARALLELIZATION_FIX.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Parallelization Fix for nf-proteindesign
2+
3+
## Problem Summary
4+
5+
The pipeline was not correctly parallelizing Boltz2 refolding across all ProteinMPNN-generated sequences from all budget designs.
6+
7+
### Original Behavior:
8+
1. ✅ Boltzgen runs once per samplesheet row
9+
2. ✅ Takes N best designs based on budget parameter (e.g., budget=2 → 2 designs)
10+
3. ✅ ProteinMPNN runs once for EACH budget design (parallel execution)
11+
4.**Boltz2 only refolded sequences from ONE budget design** (not all)
12+
5. ✅ ipSAE and Prodigy run on structures (but missing some due to Boltz2 issue)
13+
14+
### Desired Behavior:
15+
1. ✅ Run Boltzgen on each row of samplesheet in parallel
16+
2. ✅ Take the N best designs based on budget number (high quality, filtered designs)
17+
3. ✅ Run ProteinMPNN on these sequences to generate X new sequences per design
18+
4.**Run Boltz2 in parallel to fold each ProteinMPNN sequence from ALL budget designs**
19+
5. ✅ Skip the first sequence in each ProteinMPNN FASTA (it's the original Boltzgen sequence)
20+
6. ✅ Calculate ipSAE and Prodigy on original Boltzgen + all ProteinMPNN refolded structures
21+
7. ✅ Combine all metrics into one comprehensive table
22+
23+
## Root Cause
24+
25+
The issue was in the channel joining logic in `workflows/protein_design.nf`:
26+
27+
### Original Code (Lines 108-120):
28+
```groovy
29+
ch_boltz2_per_sequence = PROTEINMPNN_OPTIMIZE.out.sequences
30+
.flatMap { ... }
31+
.map { meta, fasta ->
32+
[meta.parent_id, meta, fasta]
33+
}
34+
.join( // ❌ PROBLEM: join only matches FIRST item with same key!
35+
EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq ->
36+
[meta.id, seq]
37+
}
38+
)
39+
.map { parent_id, meta, fasta, target_seq ->
40+
[meta, fasta, target_seq]
41+
}
42+
```
43+
44+
### Why This Failed:
45+
- **Multiple ProteinMPNN outputs** from different budget designs:
46+
- `2vsm_protein_binder_rank_1` (parent_id: "2vsm_protein_binder")
47+
- `2vsm_protein_binder_rank_2` (parent_id: "2vsm_protein_binder")
48+
- **One EXTRACT_TARGET_SEQUENCES output**:
49+
- `2vsm_protein_binder` (id: "2vsm_protein_binder")
50+
- **join() behavior**: When multiple items have the same key, join only matches the FIRST one and drops the rest!
51+
52+
## Solution
53+
54+
### Fix #1: Use `combine` Instead of `join`
55+
56+
Changed from `join()` to `combine(by: 0)` to ensure ALL ProteinMPNN sequences are paired with the target sequence:
57+
58+
```groovy
59+
ch_boltz2_per_sequence = PROTEINMPNN_OPTIMIZE.out.sequences
60+
.flatMap { ... }
61+
.map { meta, fasta ->
62+
[meta.parent_id, meta, fasta]
63+
}
64+
.combine( // ✅ FIXED: combine pairs ALL items with same key!
65+
EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq ->
66+
[meta.id, seq]
67+
},
68+
by: 0 // Combine by parent_id (index 0)
69+
)
70+
.map { parent_id, meta, fasta, target_seq ->
71+
[meta, fasta, target_seq]
72+
}
73+
```
74+
75+
**Key Difference:**
76+
- `join()`: One-to-one matching (first match only)
77+
- `combine(by: key)`: All-to-all matching for items with same key
78+
79+
### Fix #2: Skip First Sequence in ProteinMPNN FASTA
80+
81+
Added logic in `modules/local/boltz2_refold.nf` to skip the first sequence:
82+
83+
```python
84+
# Skip the first sequence (it's always the original sequence from Boltzgen)
85+
# We only want to refold the NEW sequences generated by ProteinMPNN
86+
sequences_to_process = sequences[1:] if len(sequences) > 1 else []
87+
88+
if not sequences_to_process:
89+
print(f"⚠ Warning: Only found 1 sequence (original), no new MPNN sequences to refold")
90+
continue
91+
92+
print(f"Processing {len(sequences_to_process)} new MPNN sequences (skipping original)")
93+
```
94+
95+
**Why This Matters:**
96+
- ProteinMPNN FASTA files always include the original sequence as the first entry
97+
- We already have the Boltzgen structure for this sequence
98+
- No need to refold it again with Boltz2
99+
- Only refold the NEW ProteinMPNN-optimized sequences
100+
101+
## Expected Flow After Fix
102+
103+
### Example with budget=2, mpnn_num_seq_per_target=8:
104+
105+
1. **Boltzgen**: Generates 2 designs (rank_1, rank_2)
106+
- `2vsm_protein_binder_output/intermediate_designs_inverse_folded/rank_1.cif`
107+
- `2vsm_protein_binder_output/intermediate_designs_inverse_folded/rank_2.cif`
108+
109+
2. **ProteinMPNN**: Runs on EACH design (2 parallel executions)
110+
- Processes `rank_1.pdb` → generates 8 sequences (1 original + 7 new)
111+
- Processes `rank_2.pdb` → generates 8 sequences (1 original + 7 new)
112+
- Total: 16 FASTA files (8 per design)
113+
114+
3. **Extract Target Sequences**: Runs once
115+
- Extracts target sequence from first Boltzgen CIF
116+
- Output: One target sequence file (same for all designs)
117+
118+
4. **Boltz2 Refolding**: Runs on ALL new sequences (14 parallel executions)
119+
- From rank_1: 7 new sequences × 1 = 7 Boltz2 runs
120+
- From rank_2: 7 new sequences × 1 = 7 Boltz2 runs
121+
- Total: 14 Boltz2 predictions (skipping 2 original sequences)
122+
123+
5. **ipSAE**: Runs on all structures
124+
- 2 original Boltzgen structures (rank_1, rank_2)
125+
- 14 Boltz2 refolded structures
126+
- Total: 16 ipSAE calculations
127+
128+
6. **Prodigy**: Runs on all structures
129+
- 2 original Boltzgen structures
130+
- 14 Boltz2 refolded structures
131+
- Total: 16 Prodigy predictions
132+
133+
7. **Consolidation**: Combines all metrics into one table
134+
- 16 rows (2 Boltzgen + 14 Boltz2)
135+
- Columns: design_name, source, pLDDT, ipSAE, prodigy_affinity, etc.
136+
137+
## Testing Recommendations
138+
139+
1. **Test with budget=1**: Verify ProteinMPNN + Boltz2 work with single design
140+
2. **Test with budget=2**: Verify all sequences from both designs are refolded
141+
3. **Check logs**: Ensure "Processing X new MPNN sequences (skipping original)" appears
142+
4. **Verify counts**:
143+
- ProteinMPNN sequences = budget × mpnn_num_seq_per_target
144+
- Boltz2 predictions = budget × (mpnn_num_seq_per_target - 1)
145+
- Total structures = budget + [budget × (mpnn_num_seq_per_target - 1)]
146+
147+
## Files Modified
148+
149+
1. `workflows/protein_design.nf` (lines 108-120)
150+
- Changed `join()` to `combine(by: 0)`
151+
152+
2. `modules/local/boltz2_refold.nf` (lines ~114-120)
153+
- Added logic to skip first sequence in FASTA files
154+
155+
## Channel Operation Comparison
156+
157+
### `join()` behavior:
158+
```
159+
Channel A: [key1, dataA1], [key1, dataA2]
160+
Channel B: [key1, dataB]
161+
Result: [key1, dataA1, dataB] // Only FIRST match!
162+
[dataA2 is DROPPED]
163+
```
164+
165+
### `combine(by: 0)` behavior:
166+
```
167+
Channel A: [key1, dataA1], [key1, dataA2]
168+
Channel B: [key1, dataB]
169+
Result: [key1, dataA1, dataB] // ALL matches!
170+
[key1, dataA2, dataB]
171+
```
172+
173+
This is exactly what we need - every ProteinMPNN sequence paired with the target sequence!

0 commit comments

Comments
 (0)