seqeralabs
diff --git a/‎CHANGES_SUMMARY.md‎
Lines changed: 139 additions & 0 deletions b/‎CHANGES_SUMMARY.md‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎PARALLELIZATION_FIX.md‎
Lines changed: 173 additions & 0 deletions b/‎PARALLELIZATION_FIX.md‎
Lines changed: 173 additions & 0 deletions
@@ -0,0 +1,139 @@
+# Summary of Parallelization Fixes
+
+## Overview
+Fixed the parallelization issue where Boltz2 refolding was only processing sequences from one budget design instead of all budget designs in parallel.
+
+## Changes Made
+
+### 1. workflows/protein_design.nf (lines 108-120)
+**Changed**: `join()` → `combine(by: 0)`
+
+**Before:**
+```groovy
+.join(
+    EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq -> 
+        [meta.id, seq]
+    }
+)
+```
+
+**After:**
+```groovy
+.combine(
+    EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq -> 
+        [meta.id, seq]
+    },
+    by: 0  // Combine by parent_id (index 0)
+)
+```
+
+**Why**: `join()` only matches the first item with the same key, dropping all subsequent items. `combine()` matches ALL items with the same key, ensuring all ProteinMPNN sequences from all budget designs are paired with the target sequence.
+
+### 2. modules/local/boltz2_refold.nf (around line 117)
+**Added**: Skip first sequence in ProteinMPNN FASTA files
+
+**Before:**
+```python
+print(f"Found {len(sequences)} sequences in {fasta_file}")
+
+# Create Boltz-2 YAML for each sequence
+for idx, (header, binder_seq) in enumerate(sequences):
+```
+
+**After:**
+```python
+print(f"Found {len(sequences)} sequences in {fasta_file}")
+
+# Skip the first sequence (it's always the original sequence from Boltzgen)
+# We only want to refold the NEW sequences generated by ProteinMPNN
+sequences_to_process = sequences[1:] if len(sequences) > 1 else []
+
+if not sequences_to_process:
+    print(f"⚠  Warning: Only found 1 sequence (original), no new MPNN sequences to refold")
+    continue
+
+print(f"Processing {len(sequences_to_process)} new MPNN sequences (skipping original)")
+
+# Create Boltz-2 YAML for each NEW sequence (skip first one)
+for idx, (header, binder_seq) in enumerate(sequences_to_process):
+```
+
+**Why**: ProteinMPNN always includes the original Boltzgen sequence as the first entry in each FASTA file. Since we already have the Boltzgen structure, we don't need to refold it again.
+
+## Documentation Added
+
+1. **PARALLELIZATION_FIX.md**: Comprehensive explanation of the problem, root cause, solution, and testing recommendations
+
+2. **WORKFLOW_FLOW_DIAGRAM.md**: Visual diagram showing the complete pipeline flow with all parallelization points
+
+3. **CHANGES_SUMMARY.md**: This file - quick reference for what changed
+
+## Expected Behavior After Fix
+
+### Example: budget=2, mpnn_num_seq_per_target=8
+
+**Step 1: Boltzgen**
+- Generates 5 designs, filters to top 2
+- Output: `rank_1.cif`, `rank_2.cif`
+
+**Step 2: ProteinMPNN** (2 parallel executions)
+- Process `rank_1.pdb` → 8 sequences (1 original + 7 new)
+- Process `rank_2.pdb` → 8 sequences (1 original + 7 new)
+- Total: 16 FASTA files
+
+**Step 3: Extract Target Sequences** (1 execution)
+- Extracts target chain from first Boltzgen CIF
+- Output: `target_seq.txt`
+
+**Step 4: Boltz2 Refolding** (14 parallel executions) ← FIXED!
+- From rank_1: 7 new sequences (skip original)
+- From rank_2: 7 new sequences (skip original)
+- Total: 14 Boltz2 predictions
+
+**Step 5: Analysis** (16 parallel executions each)
+- ipSAE: 2 Boltzgen + 14 Boltz2 = 16 scores
+- Prodigy: 2 Boltzgen + 14 Boltz2 = 16 scores
+- Foldseek: 2 Boltzgen + 14 Boltz2 = 16 searches
+
+**Step 6: Consolidation** (1 execution)
+- Combines all metrics into one table
+- Output: 16 rows (2 Boltzgen + 14 Boltz2)
+
+## Impact
+
+### Before Fix:
+- ❌ Only sequences from first budget design were refolded
+- ❌ Example: 2 Boltzgen + 7 Boltz2 = 9 total structures
+- ❌ Missing half the expected data
+
+### After Fix:
+- ✅ ALL sequences from ALL budget designs are refolded
+- ✅ Example: 2 Boltzgen + 14 Boltz2 = 16 total structures
+- ✅ Complete parallelization as intended
+
+## Testing Checklist
+
+- [ ] Test with budget=1 (verify ProteinMPNN + Boltz2 work)
+- [ ] Test with budget=2 (verify all designs processed)
+- [ ] Test with budget=5 (verify scales correctly)
+- [ ] Check logs for "Processing X new MPNN sequences (skipping original)"
+- [ ] Verify final metrics table has expected row count:
+  - Expected rows = budget + [budget × (mpnn_num_seq_per_target - 1)]
+  - Example: budget=2, mpnn_num_seq_per_target=8 → 2 + (2 × 7) = 16 rows
+- [ ] Verify Boltz2 skips first sequence in each FASTA
+- [ ] Verify all budget designs contribute to final results
+
+## Files Modified
+
+1. `workflows/protein_design.nf` - Channel operations fix
+2. `modules/local/boltz2_refold.nf` - Skip first sequence logic
+
+## Files Added
+
+1. `PARALLELIZATION_FIX.md` - Detailed technical explanation
+2. `WORKFLOW_FLOW_DIAGRAM.md` - Visual flow diagram
+3. `CHANGES_SUMMARY.md` - This file
+
+## Questions?
+
+See `PARALLELIZATION_FIX.md` for detailed explanations and `WORKFLOW_FLOW_DIAGRAM.md` for visual representation of the complete workflow.
@@ -0,0 +1,173 @@
+# Parallelization Fix for nf-proteindesign
+
+## Problem Summary
+
+The pipeline was not correctly parallelizing Boltz2 refolding across all ProteinMPNN-generated sequences from all budget designs. 
+
+### Original Behavior:
+1. ✅ Boltzgen runs once per samplesheet row
+2. ✅ Takes N best designs based on budget parameter (e.g., budget=2 → 2 designs)
+3. ✅ ProteinMPNN runs once for EACH budget design (parallel execution)
+4. ❌ **Boltz2 only refolded sequences from ONE budget design** (not all)
+5. ✅ ipSAE and Prodigy run on structures (but missing some due to Boltz2 issue)
+
+### Desired Behavior:
+1. ✅ Run Boltzgen on each row of samplesheet in parallel
+2. ✅ Take the N best designs based on budget number (high quality, filtered designs)
+3. ✅ Run ProteinMPNN on these sequences to generate X new sequences per design
+4. ✅ **Run Boltz2 in parallel to fold each ProteinMPNN sequence from ALL budget designs**
+5. ✅ Skip the first sequence in each ProteinMPNN FASTA (it's the original Boltzgen sequence)
+6. ✅ Calculate ipSAE and Prodigy on original Boltzgen + all ProteinMPNN refolded structures
+7. ✅ Combine all metrics into one comprehensive table
+
+## Root Cause
+
+The issue was in the channel joining logic in `workflows/protein_design.nf`:
+
+### Original Code (Lines 108-120):
+```groovy
+ch_boltz2_per_sequence = PROTEINMPNN_OPTIMIZE.out.sequences
+    .flatMap { ... }
+    .map { meta, fasta -> 
+        [meta.parent_id, meta, fasta]
+    }
+    .join(  // ❌ PROBLEM: join only matches FIRST item with same key!
+        EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq -> 
+            [meta.id, seq]
+        }
+    )
+    .map { parent_id, meta, fasta, target_seq ->
+        [meta, fasta, target_seq]
+    }
+```
+
+### Why This Failed:
+- **Multiple ProteinMPNN outputs** from different budget designs:
+  - `2vsm_protein_binder_rank_1` (parent_id: "2vsm_protein_binder")
+  - `2vsm_protein_binder_rank_2` (parent_id: "2vsm_protein_binder")
+- **One EXTRACT_TARGET_SEQUENCES output**:
+  - `2vsm_protein_binder` (id: "2vsm_protein_binder")
+- **join() behavior**: When multiple items have the same key, join only matches the FIRST one and drops the rest!
+
+## Solution
+
+### Fix #1: Use `combine` Instead of `join`
+
+Changed from `join()` to `combine(by: 0)` to ensure ALL ProteinMPNN sequences are paired with the target sequence:
+
+```groovy
+ch_boltz2_per_sequence = PROTEINMPNN_OPTIMIZE.out.sequences
+    .flatMap { ... }
+    .map { meta, fasta -> 
+        [meta.parent_id, meta, fasta]
+    }
+    .combine(  // ✅ FIXED: combine pairs ALL items with same key!
+        EXTRACT_TARGET_SEQUENCES.out.target_sequences.map { meta, seq -> 
+            [meta.id, seq]
+        },
+        by: 0  // Combine by parent_id (index 0)
+    )
+    .map { parent_id, meta, fasta, target_seq ->
+        [meta, fasta, target_seq]
+    }
+```
+
+**Key Difference:**
+- `join()`: One-to-one matching (first match only)
+- `combine(by: key)`: All-to-all matching for items with same key
+
+### Fix #2: Skip First Sequence in ProteinMPNN FASTA
+
+Added logic in `modules/local/boltz2_refold.nf` to skip the first sequence:
+
+```python
+# Skip the first sequence (it's always the original sequence from Boltzgen)
+# We only want to refold the NEW sequences generated by ProteinMPNN
+sequences_to_process = sequences[1:] if len(sequences) > 1 else []
+
+if not sequences_to_process:
+    print(f"⚠  Warning: Only found 1 sequence (original), no new MPNN sequences to refold")
+    continue
+
+print(f"Processing {len(sequences_to_process)} new MPNN sequences (skipping original)")
+```
+
+**Why This Matters:**
+- ProteinMPNN FASTA files always include the original sequence as the first entry
+- We already have the Boltzgen structure for this sequence
+- No need to refold it again with Boltz2
+- Only refold the NEW ProteinMPNN-optimized sequences
+
+## Expected Flow After Fix
+
+### Example with budget=2, mpnn_num_seq_per_target=8:
+
+1. **Boltzgen**: Generates 2 designs (rank_1, rank_2)
+   - `2vsm_protein_binder_output/intermediate_designs_inverse_folded/rank_1.cif`
+   - `2vsm_protein_binder_output/intermediate_designs_inverse_folded/rank_2.cif`
+
+2. **ProteinMPNN**: Runs on EACH design (2 parallel executions)
+   - Processes `rank_1.pdb` → generates 8 sequences (1 original + 7 new)
+   - Processes `rank_2.pdb` → generates 8 sequences (1 original + 7 new)
+   - Total: 16 FASTA files (8 per design)
+
+3. **Extract Target Sequences**: Runs once
+   - Extracts target sequence from first Boltzgen CIF
+   - Output: One target sequence file (same for all designs)
+
+4. **Boltz2 Refolding**: Runs on ALL new sequences (14 parallel executions)
+   - From rank_1: 7 new sequences × 1 = 7 Boltz2 runs
+   - From rank_2: 7 new sequences × 1 = 7 Boltz2 runs
+   - Total: 14 Boltz2 predictions (skipping 2 original sequences)
+
+5. **ipSAE**: Runs on all structures
+   - 2 original Boltzgen structures (rank_1, rank_2)
+   - 14 Boltz2 refolded structures
+   - Total: 16 ipSAE calculations
+
+6. **Prodigy**: Runs on all structures
+   - 2 original Boltzgen structures
+   - 14 Boltz2 refolded structures
+   - Total: 16 Prodigy predictions
+
+7. **Consolidation**: Combines all metrics into one table
+   - 16 rows (2 Boltzgen + 14 Boltz2)
+   - Columns: design_name, source, pLDDT, ipSAE, prodigy_affinity, etc.
+
+## Testing Recommendations
+
+1. **Test with budget=1**: Verify ProteinMPNN + Boltz2 work with single design
+2. **Test with budget=2**: Verify all sequences from both designs are refolded
+3. **Check logs**: Ensure "Processing X new MPNN sequences (skipping original)" appears
+4. **Verify counts**: 
+   - ProteinMPNN sequences = budget × mpnn_num_seq_per_target
+   - Boltz2 predictions = budget × (mpnn_num_seq_per_target - 1)
+   - Total structures = budget + [budget × (mpnn_num_seq_per_target - 1)]
+
+## Files Modified
+
+1. `workflows/protein_design.nf` (lines 108-120)
+   - Changed `join()` to `combine(by: 0)`
+   
+2. `modules/local/boltz2_refold.nf` (lines ~114-120)
+   - Added logic to skip first sequence in FASTA files
+
+## Channel Operation Comparison
+
+### `join()` behavior:
+```
+Channel A: [key1, dataA1], [key1, dataA2]
+Channel B: [key1, dataB]
+Result:    [key1, dataA1, dataB]  // Only FIRST match!
+           [dataA2 is DROPPED]
+```
+
+### `combine(by: 0)` behavior:
+```
+Channel A: [key1, dataA1], [key1, dataA2]
+Channel B: [key1, dataB]
+Result:    [key1, dataA1, dataB]  // ALL matches!
+           [key1, dataA2, dataB]
+```
+
+This is exactly what we need - every ProteinMPNN sequence paired with the target sequence!