Merge pull request #70 from seqeralabs/seqera-ai/20251201-151008-add-boltzgen-output-reuse

FloWuenne · web-flow · commit a02d5d128351 · 2025-12-01T10:16:44.000-05:00
Add boltzgen_output_dir option to reuse expensive Boltzgen results
diff --git a/BOLTZGEN_REUSE_FEATURE.md b/BOLTZGEN_REUSE_FEATURE.md
@@ -0,0 +1,148 @@
+# Boltzgen Output Reuse Feature
+
+## Overview
+
+The `boltzgen_output_dir` feature allows you to skip the computationally expensive Boltzgen step and start the pipeline directly from ProteinMPNN using pre-computed Boltzgen results. This is particularly useful when:
+
+1. **Nextflow cache is invalidated** - Even though only parameters changed, Nextflow sometimes invalidates the Boltzgen cache
+2. **Testing downstream analyses** - You want to experiment with ProteinMPNN, Boltz-2, or other analysis parameters without re-running Boltzgen
+3. **Iterative refinement** - You're satisfied with Boltzgen designs and want to focus on sequence optimization and refolding
+
+## How It Works
+
+When you provide a `boltzgen_output_dir` in your samplesheet, the pipeline will:
+- **Skip running Boltzgen** for that sample
+- **Use the existing Boltzgen output directory** as if it was just computed
+- **Continue with ProteinMPNN and downstream analyses** using the pre-computed structures
+
+## Samplesheet Configuration
+
+### Without Boltzgen Reuse (Normal Mode)
+
+```csv
+sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
+my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,
+```
+
+### With Boltzgen Reuse (Skip Boltzgen)
+
+```csv
+sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
+my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/my_design/boltzgen/my_design_output
+```
+
+**Key points:**
+- The `boltzgen_output_dir` should point to the Boltzgen output directory (typically `{sample_id}_output`)
+- The path can be **relative** (from launch directory or project directory) or **absolute**
+- Even when reusing, you must still provide `design_yaml` and `structure_files` (for consistency, though they won't be used)
+- The directory must contain the standard Boltzgen output structure
+
+## Expected Boltzgen Output Directory Structure
+
+The `boltzgen_output_dir` should have this structure:
+
+```
+my_design_output/
+├── final_ranked_designs/
+│   ├── final_1_designs/
+│   │   ├── rank1_*.cif
+│   │   └── rank2_*.cif
+│   └── final_2_designs/
+│       ├── rank1_*.cif
+│       └── rank2_*.cif
+├── intermediate_designs/
+│   ├── design_*.cif
+│   └── design_*.npz
+├── intermediate_designs_inverse_folded/
+│   └── *.npz
+├── aggregate_metrics_analyze.csv
+└── per_target_metrics_analyze.csv
+```
+
+The pipeline specifically requires:
+- `final_ranked_designs/final_*_designs/*.cif` - Budget design CIF files for ProteinMPNN
+
+## Example Use Case
+
+### Step 1: Initial Run with Boltzgen
+
+**samplesheet.csv:**
+```csv
+sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
+2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,
+```
+
+**Run pipeline:**
+```bash
+nextflow run main.nf \
+  -profile docker \
+  --input samplesheet.csv \
+  --outdir results \
+  --run_proteinmpnn \
+  --run_boltz2_refold
+```
+
+This generates results in: `results/2vsm_design/boltzgen/2vsm_design_output/`
+
+### Step 2: Re-run with Different ProteinMPNN Parameters
+
+Now you want to test different ProteinMPNN parameters but don't want to re-run Boltzgen:
+
+**samplesheet_reuse.csv:**
+```csv
+sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
+2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/2vsm_design/boltzgen/2vsm_design_output
+```
+
+**Run pipeline with new parameters:**
+```bash
+nextflow run main.nf \
+  -profile docker \
+  --input samplesheet_reuse.csv \
+  --outdir results_retest \
+  --run_proteinmpnn \
+  --run_boltz2_refold \
+  --mpnn_num_seqs 16  # Try more sequences
+```
+
+### Step 3: Compare Results
+
+You can now compare the outputs from different downstream analyses while using the same Boltzgen designs.
+
+## Benefits
+
+1. **💰 Cost Savings** - Boltzgen is GPU-intensive; skipping it saves compute costs
+2. **⏱️ Time Savings** - Typical Boltzgen run: 30-60 minutes; this feature: instant start
+3. **🔬 Experimentation** - Test multiple downstream parameter combinations efficiently
+4. **🛡️ Cache Safety** - Preserve expensive results even when Nextflow cache is invalidated
+5. **📊 Reproducibility** - Use exact same Boltzgen designs across multiple analysis runs
+
+## Important Notes
+
+- The `boltzgen_output_dir` field is **optional** - leave it blank for normal Boltzgen execution
+- When provided, Boltzgen will be completely skipped for that sample
+- You can mix samples with and without `boltzgen_output_dir` in the same samplesheet
+- The directory structure must match what Boltzgen produces
+- ProteinMPNN and downstream analyses will work identically whether Boltzgen was just run or reused
+
+## Troubleshooting
+
+### Error: "Cannot find directory"
+- Check that the path to `boltzgen_output_dir` is correct
+- Use absolute path if relative path isn't working
+- Verify the directory exists and has proper permissions
+
+### Error: "No CIF files found"
+- Ensure the directory structure matches expected Boltzgen output
+- Check that `final_ranked_designs/final_*_designs/*.cif` files exist
+
+### Unexpected behavior
+- Verify that the pre-computed Boltzgen results match the expected design
+- Check that the `sample_id` matches between runs (for consistent output naming)
+
+## Future Enhancements
+
+Potential future improvements:
+- Automatic detection of Boltzgen output directories
+- Validation of directory structure before starting
+- Support for partial reuse (e.g., reuse intermediate but not final designs)
diff --git a/assets/schema_input_design.json b/assets/schema_input_design.json
@@ -57,6 +57,10 @@
         "type": "string",
         "pattern": "^\\S+\\.cif$",
         "errorMessage": "Target template must be a valid file path to a CIF file (e.g., 'target_structure.cif')"
+      },
+      "boltzgen_output_dir": {
+        "type": "string",
+        "errorMessage": "Boltzgen output directory must be a valid directory path to pre-computed Boltzgen results (e.g., 'results/sample1/boltzgen/sample1_output')"
       }
     },
     "required": [
diff --git a/main.nf b/main.nf
@@ -99,7 +99,7 @@ workflow NFPROTEINDESIGN {
         .fromList(design_samplesheet)
         .map { tuple ->
             // samplesheetToList returns list of values in schema order
-            // Order: sample_id, design_yaml, structure_files, protocol, num_designs, budget, reuse, target_msa, target_sequence, target_template
+            // Order: sample_id, design_yaml, structure_files, protocol, num_designs, budget, reuse, target_msa, target_sequence, target_template, boltzgen_output_dir
             def sample_id = tuple[0]
             def design_yaml_path = tuple[1]
             def structure_files_str = tuple[2]
@@ -110,6 +110,7 @@ workflow NFPROTEINDESIGN {
             def target_msa_path = tuple.size() > 7 ? tuple[7] : null
             def target_sequence_path = tuple.size() > 8 ? tuple[8] : null
             def target_template_path = tuple.size() > 9 ? tuple[9] : null
+            def boltzgen_output_dir_path = tuple.size() > 10 ? tuple[10] : null
             
             // Convert design YAML to file object and validate existence
             // Smart path resolution: try launchDir first (for local runs), then projectDir (for Platform)
@@ -191,14 +192,29 @@ workflow NFPROTEINDESIGN {
                 }
             }
 
+            // Parse boltzgen_output_dir if provided
+            def boltzgen_output_dir = null
+            if (boltzgen_output_dir_path) {
+                if (boltzgen_output_dir_path.startsWith('/') || boltzgen_output_dir_path.contains('://')) {
+                    boltzgen_output_dir = file(boltzgen_output_dir_path, type: 'dir', checkIfExists: true)
+                } else {
+                    def launchDir_path = file(boltzgen_output_dir_path, type: 'dir')
+                    if (launchDir_path.exists()) {
+                        boltzgen_output_dir = launchDir_path
+                    } else {
+                        boltzgen_output_dir = file("${project_dir}/${boltzgen_output_dir_path}", type: 'dir', checkIfExists: true)
+                    }
+                }
+            }
+
             def meta = [:]
             meta.id = sample_id
             meta.protocol = protocol
             meta.num_designs = num_designs
             meta.budget = budget
             meta.reuse = reuse ?: false
 
-            [meta, design_yaml, structure_files, target_msa, target_sequence, target_template]
+            [meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir]
         }
 
     // ========================================================================
diff --git a/workflows/protein_design.nf b/workflows/protein_design.nf
@@ -19,24 +19,52 @@ include { CONSOLIDATE_METRICS } from '../modules/local/consolidate_metrics'
 workflow PROTEIN_DESIGN {
 
     take:
-    ch_input         // channel: [meta, design_yaml, structure_files, target_msa, target_sequence, target_template]
+    ch_input         // channel: [meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir]
     ch_cache         // channel: path to cache directory or EMPTY_CACHE placeholder
     ch_boltz2_cache  // channel: path to Boltz-2 cache directory or EMPTY_BOLTZ2_CACHE placeholder
 
     main:
 
     // ========================================================================
-    // Run Boltzgen on design YAMLs
+    // Run Boltzgen on design YAMLs OR use pre-computed results
     // ========================================================================
 
-    // Prepare Boltzgen input by removing target_msa, target_sequence, and target_template (not needed for Boltzgen)
-    ch_boltzgen_input = ch_input
-        .map { meta, design_yaml, structure_files, target_msa, target_sequence, target_template ->
-            [meta, design_yaml, structure_files]
+    // Split input channel into two branches: with and without pre-computed Boltzgen results
+    ch_input
+        .branch { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir ->
+            with_precomputed: boltzgen_output_dir != null
+                return [meta, boltzgen_output_dir]
+            needs_boltzgen: boltzgen_output_dir == null
+                return [meta, design_yaml, structure_files]
+        }
+        .set { ch_branched }
+
+    // Run Boltzgen only for samples without pre-computed results
+    BOLTZGEN_RUN(ch_branched.needs_boltzgen, ch_cache)
+    
+    // Create channel from pre-computed Boltzgen output directories
+    ch_precomputed_boltzgen = ch_branched.with_precomputed
+        .map { meta, boltzgen_dir ->
+            // Stage the pre-computed directory as if it came from BOLTZGEN_RUN
+            [meta, boltzgen_dir]
+        }
+    
+    // Combine Boltzgen results from both sources (newly run + pre-computed)
+    ch_boltzgen_results = BOLTZGEN_RUN.out.results
+        .mix(ch_precomputed_boltzgen)
+    
+    // Extract budget_design_cifs from both sources for downstream processing
+    ch_budget_cifs_new = BOLTZGEN_RUN.out.budget_design_cifs
+    
+    ch_budget_cifs_precomputed = ch_branched.with_precomputed
+        .map { meta, boltzgen_dir ->
+            // Extract budget design CIF files from pre-computed directory
+            def budget_cifs = file("${boltzgen_dir}/final_ranked_designs/final_*_designs/*.cif")
+            [meta, budget_cifs]
         }
     
-    // Run Boltzgen for each design in parallel
-    BOLTZGEN_RUN(ch_boltzgen_input, ch_cache)
+    ch_budget_design_cifs = ch_budget_cifs_new
+        .mix(ch_budget_cifs_precomputed)
     
     // ========================================================================
     // ProteinMPNN: Optimize sequences for designed structures
@@ -45,7 +73,8 @@ workflow PROTEIN_DESIGN {
         // Step 1: Convert CIF structures to PDB format (ProteinMPNN requires PDB)
         // Use budget_design_cifs which contains ONLY the budget designs (e.g., 2 structures if budget=2)
         // NOT all designs from results directory
-        CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs)
+        // Use the combined channel that includes both newly computed and pre-computed Boltzgen results
+        CONVERT_CIF_TO_PDB(ch_budget_design_cifs)
         
         // Step 2: Parallelize ProteinMPNN - run separately for each budget design
         // Use flatMap to create individual tasks per PDB file (one per budget iteration)
@@ -179,7 +208,8 @@ workflow PROTEIN_DESIGN {
         }
     } else {
         // Use Boltzgen outputs directly if ProteinMPNN is disabled
-        ch_final_designs_for_analysis = BOLTZGEN_RUN.out.results
+        // Use the combined channel that includes both newly computed and pre-computed results
+        ch_final_designs_for_analysis = ch_boltzgen_results
     }
     
     // ========================================================================
@@ -398,9 +428,9 @@ workflow PROTEIN_DESIGN {
     }
 
     emit:
-    // Boltzgen outputs
-    boltzgen_results = BOLTZGEN_RUN.out.results
-    final_designs = BOLTZGEN_RUN.out.final_designs
+    // Boltzgen outputs (combined from both newly computed and pre-computed sources)
+    boltzgen_results = ch_boltzgen_results
+    final_designs = ch_budget_design_cifs
     
     // ProteinMPNN outputs (will be empty if not run)
     mpnn_optimized = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.optimized_designs : Channel.empty()