|
| 1 | +# Channel Grouping and Output Restructuring Fixes |
| 2 | + |
| 3 | +## Summary of Changes |
| 4 | + |
| 5 | +This document describes the fixes applied to ensure: |
| 6 | +1. **All budget designs from Boltzgen get processed by ipSAE and Prodigy** |
| 7 | +2. **Restructured output directories** for clearer organization |
| 8 | + |
| 9 | +## Problems Identified |
| 10 | + |
| 11 | +### 1. Channel Grouping Issue |
| 12 | +**Problem**: Not all budget designs from Boltzgen were being processed by ipSAE and Prodigy. |
| 13 | + |
| 14 | +**Root Cause**: The `budget_design_cifs` output in `boltzgen_run.nf` was using the wrong glob pattern: |
| 15 | +- Old: `${meta.id}_output/final_ranked_designs/final_*_designs/*.cif` |
| 16 | +- This pattern tried to match nested subdirectories with wildcards, which doesn't reliably capture all files |
| 17 | + |
| 18 | +**Solution**: Changed to use the correct directory where Boltzgen places ALL budget designs: |
| 19 | +- New: `${meta.id}_output/intermediate_designs_inverse_folded/*.cif` |
| 20 | +- This directory contains exactly the budget designs (e.g., if budget=2, there are 2 CIF files) |
| 21 | +- Similarly updated NPZ files: `${meta.id}_output/intermediate_designs_inverse_folded/*.npz` |
| 22 | + |
| 23 | +### 2. Output Structure Issue |
| 24 | +**Problem**: Output directories were inconsistent and unclear: |
| 25 | +- Boltzgen outputs went to: `{sample_id}/` |
| 26 | +- ipSAE outputs went to: `{sample_id}/ipsae_scores/` |
| 27 | +- Prodigy outputs went to: `{parent_id}/prodigy/` |
| 28 | +- Boltz2 outputs went to: `{parent_id}/boltz2/` |
| 29 | + |
| 30 | +This made it hard to see which results belonged to which design row. |
| 31 | + |
| 32 | +**Solution**: Restructured all outputs to use a consistent parent folder structure: |
| 33 | +``` |
| 34 | +outdir/ |
| 35 | +└── {sample_id}/ # Parent folder for each design row from samplesheet |
| 36 | + ├── boltzgen/ # Boltzgen results |
| 37 | + ├── ipsae/ # ipSAE scores |
| 38 | + ├── prodigy/ # Prodigy results |
| 39 | + ├── proteinmpnn/ # ProteinMPNN results (if enabled) |
| 40 | + ├── boltz2/ # Boltz2 results (if enabled) |
| 41 | + └── foldseek/ # Foldseek results (if enabled) |
| 42 | +``` |
| 43 | + |
| 44 | +## Files Modified |
| 45 | + |
| 46 | +### 1. `modules/local/boltzgen_run.nf` |
| 47 | +**Changes**: |
| 48 | +- Fixed `budget_design_cifs` output glob pattern to use `intermediate_designs_inverse_folded/*.cif` |
| 49 | +- Fixed `budget_design_npz` output glob pattern to use `intermediate_designs_inverse_folded/*.npz` |
| 50 | +- Changed publishDir from `${params.outdir}/${meta.id}` to `${params.outdir}/${meta.id}/boltzgen` |
| 51 | + |
| 52 | +**Impact**: |
| 53 | +- Ensures ALL budget designs are captured and passed to downstream processes |
| 54 | +- Organizes Boltzgen outputs into a dedicated subfolder |
| 55 | + |
| 56 | +### 2. `modules/local/ipsae_calculate.nf` |
| 57 | +**Changes**: |
| 58 | +- Changed publishDir from `${params.outdir}/${meta.id}/ipsae_scores` to `${params.outdir}/${meta.parent_id ?: meta.id}/ipsae` |
| 59 | +- Added comment explaining parent_id usage |
| 60 | + |
| 61 | +**Impact**: |
| 62 | +- ipSAE results now go into the parent design folder |
| 63 | +- Consistent naming with other tools (ipsae instead of ipsae_scores) |
| 64 | + |
| 65 | +### 3. `modules/local/prodigy_predict.nf` |
| 66 | +**Changes**: |
| 67 | +- Already using `${params.outdir}/${meta.parent_id ?: meta.id}/prodigy` ✓ |
| 68 | +- No changes needed |
| 69 | + |
| 70 | +### 4. `modules/local/proteinmpnn_optimize.nf` |
| 71 | +**Changes**: |
| 72 | +- Changed publishDir from `${params.outdir}/${meta.id}/proteinmpnn` to `${params.outdir}/${meta.parent_id ?: meta.id}/proteinmpnn` |
| 73 | +- Added comment explaining parent_id usage |
| 74 | + |
| 75 | +**Impact**: |
| 76 | +- ProteinMPNN results now go into the parent design folder |
| 77 | + |
| 78 | +### 5. `modules/local/boltz2_refold.nf` |
| 79 | +**Changes**: |
| 80 | +- Updated comment for clarity |
| 81 | +- Already using `${params.outdir}/${meta.parent_id ?: meta.id}/boltz2` ✓ |
| 82 | +- Added fallback to meta.id if parent_id is not set |
| 83 | + |
| 84 | +### 6. `modules/local/foldseek_search.nf` |
| 85 | +**Changes**: |
| 86 | +- Already using `${params.outdir}/${meta.parent_id ?: meta.id}/foldseek` ✓ |
| 87 | +- No changes needed |
| 88 | + |
| 89 | +### 7. `modules/local/consolidate_metrics.nf` |
| 90 | +**Changes**: |
| 91 | +- Updated `ipsae_pattern` from `**/ipsae_scores/*` to `**/ipsae/*` |
| 92 | +- This ensures the consolidation script finds ipSAE results in the new location |
| 93 | + |
| 94 | +## How the Parallelization Works |
| 95 | + |
| 96 | +### Budget Designs Flow |
| 97 | +1. **Boltzgen** generates N designs based on `budget` parameter (e.g., budget=2 → 2 designs) |
| 98 | +2. **Output channel** `budget_design_cifs` emits: `[meta, [design_1.cif, design_2.cif]]` |
| 99 | +3. **flatMap** in workflow creates individual tasks: |
| 100 | + - Task 1: `[meta1, design_1.cif]` → ipSAE |
| 101 | + - Task 2: `[meta2, design_1.cif]` → Prodigy |
| 102 | + - Task 3: `[meta1, design_2.cif]` → ipSAE |
| 103 | + - Task 4: `[meta2, design_2.cif]` → Prodigy |
| 104 | + |
| 105 | +### ProteinMPNN + Boltz2 Flow |
| 106 | +1. **ProteinMPNN** generates M sequences per budget design (e.g., 8 sequences × 2 designs = 16 sequences) |
| 107 | +2. **Split sequences** creates individual FASTA files (16 files) |
| 108 | +3. **Boltz2** refolds each sequence (16 parallel tasks) |
| 109 | +4. **ipSAE and Prodigy** run on each Boltz2 output (32 parallel tasks) |
| 110 | + |
| 111 | +## Testing Recommendations |
| 112 | + |
| 113 | +To verify these changes work correctly: |
| 114 | + |
| 115 | +1. **Test with budget=2**: |
| 116 | + ```bash |
| 117 | + nextflow run main.nf --input samplesheet.csv --budget 2 --run_ipsae --run_prodigy |
| 118 | + ``` |
| 119 | + |
| 120 | + **Expected results**: |
| 121 | + - 2 ipSAE tasks per design (2 × N designs) |
| 122 | + - 2 Prodigy tasks per design (2 × N designs) |
| 123 | + |
| 124 | +2. **Check output structure**: |
| 125 | + ```bash |
| 126 | + tree results/ |
| 127 | + ``` |
| 128 | + |
| 129 | + **Expected structure**: |
| 130 | + ``` |
| 131 | + results/ |
| 132 | + ├── sample1/ |
| 133 | + │ ├── boltzgen/ |
| 134 | + │ │ └── sample1_output/ |
| 135 | + │ ├── ipsae/ |
| 136 | + │ │ ├── sample1_design1_10_10.txt |
| 137 | + │ │ └── sample1_design2_10_10.txt |
| 138 | + │ └── prodigy/ |
| 139 | + │ ├── sample1_design1_prodigy_summary.csv |
| 140 | + │ └── sample1_design2_prodigy_summary.csv |
| 141 | + └── sample2/ |
| 142 | + └── ... |
| 143 | + ``` |
| 144 | + |
| 145 | +3. **Verify ipSAE/Prodigy counts**: |
| 146 | + - Count files in each sample's ipsae folder: should equal budget value |
| 147 | + - Count files in each sample's prodigy folder: should equal budget value |
| 148 | + - If ProteinMPNN+Boltz2 enabled: should also have results for each refolded sequence |
| 149 | + |
| 150 | +## Benefits |
| 151 | + |
| 152 | +### 1. Complete Analysis Coverage |
| 153 | +- Every budget design now gets scored by ipSAE and Prodigy |
| 154 | +- No designs are skipped or missed |
| 155 | +- Parallel processing ensures fast execution |
| 156 | + |
| 157 | +### 2. Clear Organization |
| 158 | +- Each sample from the samplesheet has its own parent folder |
| 159 | +- Easy to see all results for a specific design |
| 160 | +- Tool-specific subfolders make it clear which analysis generated which files |
| 161 | + |
| 162 | +### 3. Scalability |
| 163 | +- Works with any budget value (1, 2, 10, etc.) |
| 164 | +- Handles variable numbers of ProteinMPNN sequences |
| 165 | +- Properly parallelizes across all designs and sequences |
| 166 | + |
| 167 | +## Migration Notes |
| 168 | + |
| 169 | +If you have existing results with the old structure, you can reorganize them: |
| 170 | + |
| 171 | +```bash |
| 172 | +# Example script to reorganize old results |
| 173 | +cd results/ |
| 174 | +for sample in */; do |
| 175 | + sample_name=${sample%/} |
| 176 | + |
| 177 | + # Move Boltzgen outputs |
| 178 | + if [ -d "$sample_name/${sample_name}_output" ]; then |
| 179 | + mkdir -p "$sample_name/boltzgen" |
| 180 | + mv "$sample_name/${sample_name}_output" "$sample_name/boltzgen/" |
| 181 | + fi |
| 182 | + |
| 183 | + # Rename ipsae_scores to ipsae |
| 184 | + if [ -d "$sample_name/ipsae_scores" ]; then |
| 185 | + mv "$sample_name/ipsae_scores" "$sample_name/ipsae" |
| 186 | + fi |
| 187 | +done |
| 188 | +``` |
| 189 | + |
| 190 | +## Additional Notes |
| 191 | + |
| 192 | +- The `meta.parent_id` field tracks the original sample_id from the samplesheet |
| 193 | +- Modules use `${meta.parent_id ?: meta.id}` as a fallback for compatibility |
| 194 | +- The consolidation module was updated to find files in the new ipsae path |
| 195 | +- No changes were needed to the actual workflow logic in `workflows/protein_design.nf` |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +**Date**: 2025-11-28 |
| 200 | +**Author**: Seqera AI |
| 201 | +**Status**: Implemented and Ready for Testing |
0 commit comments