|
| 1 | +# Fixes Summary: ProteinMPNN Execution Count & EXTRACT_TARGET_SEQUENCES |
| 2 | + |
| 3 | +## Date: 2025-11-22 |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Issue 1: ProteinMPNN Running 5 Times Instead of 2 |
| 8 | + |
| 9 | +### Problem |
| 10 | +ProteinMPNN was executing 5 times even though `params.budget = 2` (meaning only 2 budget designs should exist). |
| 11 | + |
| 12 | +### Root Cause |
| 13 | +The input channel for `CONVERT_CIF_TO_PDB` was incorrectly configured: |
| 14 | + |
| 15 | +**BEFORE (INCORRECT):** |
| 16 | +```groovy |
| 17 | +ch_structures_for_conversion = BOLTZGEN_RUN.out.results |
| 18 | + .map { meta, results_dir -> |
| 19 | + def budget_designs_dir = file("${results_dir}/intermediate_designs_inverse_folded") |
| 20 | + [meta, budget_designs_dir] |
| 21 | + } |
| 22 | +``` |
| 23 | + |
| 24 | +This was passing the **entire results directory** which contains: |
| 25 | +- `final_1_designs/` (1 structure) |
| 26 | +- `intermediate_ranked_10_designs/` (10 structures) |
| 27 | +- `intermediate_designs_inverse_folded/` (2 structures - budget designs) |
| 28 | +- `intermediate_designs/` (10 structures) |
| 29 | +- etc. |
| 30 | + |
| 31 | +So it was converting **ALL structures** from multiple subdirectories, not just the budget designs. |
| 32 | + |
| 33 | +### Solution |
| 34 | +Changed to use the dedicated `budget_design_cifs` output from BOLTZGEN_RUN: |
| 35 | + |
| 36 | +**AFTER (CORRECT):** |
| 37 | +```groovy |
| 38 | +CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs) |
| 39 | +``` |
| 40 | + |
| 41 | +The `budget_design_cifs` output is specifically curated to contain **ONLY** the budget designs (2 structures when budget=2). |
| 42 | + |
| 43 | +### Expected Behavior Now |
| 44 | +- ProteinMPNN will run **exactly 2 times** (once per budget design) |
| 45 | +- Each execution processes one PDB structure |
| 46 | +- Downstream Protenix refolding inherits the same parallelization |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Issue 2: EXTRACT_TARGET_SEQUENCES - Purpose and Naming |
| 51 | + |
| 52 | +### What Does EXTRACT_TARGET_SEQUENCES Do? |
| 53 | + |
| 54 | +This process extracts the **target protein sequence** (binding partner) from the original Boltzgen-designed structures. |
| 55 | + |
| 56 | +### Why Do We Need It? |
| 57 | + |
| 58 | +**Context:** When ProteinMPNN generates new sequences for the binder protein, we want to refold those sequences with Protenix to verify they maintain the correct structure. |
| 59 | + |
| 60 | +**Problem:** Protenix needs to know which chain is the **target** (the protein you're designing a binder against) so it can: |
| 61 | +1. Keep the target chain in its correct position during refolding |
| 62 | +2. Properly model the binder-target interaction |
| 63 | +3. Generate accurate confidence scores for the complex |
| 64 | + |
| 65 | +**Solution:** Extract the target sequence from the original Boltzgen structures and pass it to Protenix along with the new ProteinMPNN sequences. |
| 66 | + |
| 67 | +### What Is the Target Sequence? |
| 68 | + |
| 69 | +In a binder design workflow: |
| 70 | +- **Binder chain**: The small protein you're designing (gets optimized by ProteinMPNN) |
| 71 | +- **Target chain**: The larger protein you want to bind to (stays fixed, extracted by this process) |
| 72 | + |
| 73 | +### Process Flow |
| 74 | + |
| 75 | +``` |
| 76 | +Boltzgen Structure (CIF) |
| 77 | + ↓ |
| 78 | +EXTRACT_TARGET_SEQUENCES |
| 79 | + ↓ |
| 80 | +Target Sequence (TXT file) |
| 81 | + ↓ |
| 82 | + ├→ Protenix Input 1: ProteinMPNN optimized sequence (binder) |
| 83 | + └→ Protenix Input 2: Target sequence (from this extraction) |
| 84 | + ↓ |
| 85 | +Protenix Refolds Complex |
| 86 | +``` |
| 87 | + |
| 88 | +### Naming and No Collisions |
| 89 | + |
| 90 | +✅ **Process name**: `EXTRACT_TARGET_SEQUENCES` (unique, no collision) |
| 91 | +✅ **Module file**: `modules/local/extract_target_sequences.nf` |
| 92 | +✅ **Script file**: `assets/extract_target_sequence.py` |
| 93 | +✅ **Output files**: `${meta.id}_target_sequences.txt` (unique per design) |
| 94 | + |
| 95 | +No naming collisions exist - this process has a distinct name and purpose separate from: |
| 96 | +- `PROTEINMPNN_OPTIMIZE` (optimizes binder sequences) |
| 97 | +- `PROTENIX_REFOLD` (refolds optimized binders with target) |
| 98 | +- `EXTRACT_*` other modules (none exist) |
| 99 | + |
| 100 | +### Example Output |
| 101 | + |
| 102 | +For design `insulin_binder`: |
| 103 | +``` |
| 104 | +insulin_binder_target_sequences.txt: |
| 105 | +MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN |
| 106 | +``` |
| 107 | + |
| 108 | +This is the insulin sequence (target) that will be passed to Protenix along with the ProteinMPNN-optimized binder sequences. |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## Summary of Changes |
| 113 | + |
| 114 | +### Files Modified |
| 115 | +1. `workflows/protein_design.nf` |
| 116 | + - Fixed ProteinMPNN input channel to use `budget_design_cifs` |
| 117 | + - Added comprehensive documentation for `EXTRACT_TARGET_SEQUENCES` step |
| 118 | + |
| 119 | +### Expected Impact |
| 120 | +- ProteinMPNN executions: ~~5~~ → **2** (correct) |
| 121 | +- Protenix executions: Proportional reduction (2 × num_seq_per_target) |
| 122 | +- Clearer understanding of target sequence extraction purpose |
| 123 | +- No naming collisions or process conflicts |
| 124 | + |
| 125 | +### Testing Recommendation |
| 126 | +Run with minimal parameters to verify: |
| 127 | +```bash |
| 128 | +nextflow run main.nf \ |
| 129 | + --designs test_designs/ \ |
| 130 | + --budget 2 \ |
| 131 | + --run_proteinmpnn true \ |
| 132 | + --run_protenix_refold true \ |
| 133 | + --mpnn_num_seq_per_target 3 |
| 134 | +``` |
| 135 | + |
| 136 | +Expected process counts: |
| 137 | +- BOLTZGEN_RUN: 1 (per design YAML) |
| 138 | +- CONVERT_CIF_TO_PDB: 1 (processes 2 budget CIFs) |
| 139 | +- PROTEINMPNN_OPTIMIZE: 2 (once per budget design) |
| 140 | +- EXTRACT_TARGET_SEQUENCES: 1 (once per design) |
| 141 | +- PROTENIX_REFOLD: 6 (2 budget designs × 3 sequences each) |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Documentation Added |
| 146 | + |
| 147 | +Enhanced inline documentation in workflow to explain: |
| 148 | +1. Why we use `budget_design_cifs` not `results` directory |
| 149 | +2. Purpose of target sequence extraction |
| 150 | +3. How target sequence is used by Protenix |
| 151 | +4. Expected parallelization pattern |
| 152 | + |
| 153 | +This should prevent future confusion about: |
| 154 | +- Which structures feed into ProteinMPNN |
| 155 | +- Why we extract target sequences separately |
| 156 | +- How the binder-target complex is modeled |
0 commit comments