|
| 1 | +# Boltzgen Output Reuse Feature |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The `boltzgen_output_dir` feature allows you to skip the computationally expensive Boltzgen step and start the pipeline directly from ProteinMPNN using pre-computed Boltzgen results. This is particularly useful when: |
| 6 | + |
| 7 | +1. **Nextflow cache is invalidated** - Even though only parameters changed, Nextflow sometimes invalidates the Boltzgen cache |
| 8 | +2. **Testing downstream analyses** - You want to experiment with ProteinMPNN, Boltz-2, or other analysis parameters without re-running Boltzgen |
| 9 | +3. **Iterative refinement** - You're satisfied with Boltzgen designs and want to focus on sequence optimization and refolding |
| 10 | + |
| 11 | +## How It Works |
| 12 | + |
| 13 | +When you provide a `boltzgen_output_dir` in your samplesheet, the pipeline will: |
| 14 | +- **Skip running Boltzgen** for that sample |
| 15 | +- **Use the existing Boltzgen output directory** as if it was just computed |
| 16 | +- **Continue with ProteinMPNN and downstream analyses** using the pre-computed structures |
| 17 | + |
| 18 | +## Samplesheet Configuration |
| 19 | + |
| 20 | +### Without Boltzgen Reuse (Normal Mode) |
| 21 | + |
| 22 | +```csv |
| 23 | +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir |
| 24 | +my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,, |
| 25 | +``` |
| 26 | + |
| 27 | +### With Boltzgen Reuse (Skip Boltzgen) |
| 28 | + |
| 29 | +```csv |
| 30 | +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir |
| 31 | +my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/my_design/boltzgen/my_design_output |
| 32 | +``` |
| 33 | + |
| 34 | +**Key points:** |
| 35 | +- The `boltzgen_output_dir` should point to the Boltzgen output directory (typically `{sample_id}_output`) |
| 36 | +- The path can be **relative** (from launch directory or project directory) or **absolute** |
| 37 | +- Even when reusing, you must still provide `design_yaml` and `structure_files` (for consistency, though they won't be used) |
| 38 | +- The directory must contain the standard Boltzgen output structure |
| 39 | + |
| 40 | +## Expected Boltzgen Output Directory Structure |
| 41 | + |
| 42 | +The `boltzgen_output_dir` should have this structure: |
| 43 | + |
| 44 | +``` |
| 45 | +my_design_output/ |
| 46 | +├── final_ranked_designs/ |
| 47 | +│ ├── final_1_designs/ |
| 48 | +│ │ ├── rank1_*.cif |
| 49 | +│ │ └── rank2_*.cif |
| 50 | +│ └── final_2_designs/ |
| 51 | +│ ├── rank1_*.cif |
| 52 | +│ └── rank2_*.cif |
| 53 | +├── intermediate_designs/ |
| 54 | +│ ├── design_*.cif |
| 55 | +│ └── design_*.npz |
| 56 | +├── intermediate_designs_inverse_folded/ |
| 57 | +│ └── *.npz |
| 58 | +├── aggregate_metrics_analyze.csv |
| 59 | +└── per_target_metrics_analyze.csv |
| 60 | +``` |
| 61 | + |
| 62 | +The pipeline specifically requires: |
| 63 | +- `final_ranked_designs/final_*_designs/*.cif` - Budget design CIF files for ProteinMPNN |
| 64 | + |
| 65 | +## Example Use Case |
| 66 | + |
| 67 | +### Step 1: Initial Run with Boltzgen |
| 68 | + |
| 69 | +**samplesheet.csv:** |
| 70 | +```csv |
| 71 | +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir |
| 72 | +2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,, |
| 73 | +``` |
| 74 | + |
| 75 | +**Run pipeline:** |
| 76 | +```bash |
| 77 | +nextflow run main.nf \ |
| 78 | + -profile docker \ |
| 79 | + --input samplesheet.csv \ |
| 80 | + --outdir results \ |
| 81 | + --run_proteinmpnn \ |
| 82 | + --run_boltz2_refold |
| 83 | +``` |
| 84 | + |
| 85 | +This generates results in: `results/2vsm_design/boltzgen/2vsm_design_output/` |
| 86 | + |
| 87 | +### Step 2: Re-run with Different ProteinMPNN Parameters |
| 88 | + |
| 89 | +Now you want to test different ProteinMPNN parameters but don't want to re-run Boltzgen: |
| 90 | + |
| 91 | +**samplesheet_reuse.csv:** |
| 92 | +```csv |
| 93 | +sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir |
| 94 | +2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/2vsm_design/boltzgen/2vsm_design_output |
| 95 | +``` |
| 96 | + |
| 97 | +**Run pipeline with new parameters:** |
| 98 | +```bash |
| 99 | +nextflow run main.nf \ |
| 100 | + -profile docker \ |
| 101 | + --input samplesheet_reuse.csv \ |
| 102 | + --outdir results_retest \ |
| 103 | + --run_proteinmpnn \ |
| 104 | + --run_boltz2_refold \ |
| 105 | + --mpnn_num_seqs 16 # Try more sequences |
| 106 | +``` |
| 107 | + |
| 108 | +### Step 3: Compare Results |
| 109 | + |
| 110 | +You can now compare the outputs from different downstream analyses while using the same Boltzgen designs. |
| 111 | + |
| 112 | +## Benefits |
| 113 | + |
| 114 | +1. **💰 Cost Savings** - Boltzgen is GPU-intensive; skipping it saves compute costs |
| 115 | +2. **⏱️ Time Savings** - Typical Boltzgen run: 30-60 minutes; this feature: instant start |
| 116 | +3. **🔬 Experimentation** - Test multiple downstream parameter combinations efficiently |
| 117 | +4. **🛡️ Cache Safety** - Preserve expensive results even when Nextflow cache is invalidated |
| 118 | +5. **📊 Reproducibility** - Use exact same Boltzgen designs across multiple analysis runs |
| 119 | + |
| 120 | +## Important Notes |
| 121 | + |
| 122 | +- The `boltzgen_output_dir` field is **optional** - leave it blank for normal Boltzgen execution |
| 123 | +- When provided, Boltzgen will be completely skipped for that sample |
| 124 | +- You can mix samples with and without `boltzgen_output_dir` in the same samplesheet |
| 125 | +- The directory structure must match what Boltzgen produces |
| 126 | +- ProteinMPNN and downstream analyses will work identically whether Boltzgen was just run or reused |
| 127 | + |
| 128 | +## Troubleshooting |
| 129 | + |
| 130 | +### Error: "Cannot find directory" |
| 131 | +- Check that the path to `boltzgen_output_dir` is correct |
| 132 | +- Use absolute path if relative path isn't working |
| 133 | +- Verify the directory exists and has proper permissions |
| 134 | + |
| 135 | +### Error: "No CIF files found" |
| 136 | +- Ensure the directory structure matches expected Boltzgen output |
| 137 | +- Check that `final_ranked_designs/final_*_designs/*.cif` files exist |
| 138 | + |
| 139 | +### Unexpected behavior |
| 140 | +- Verify that the pre-computed Boltzgen results match the expected design |
| 141 | +- Check that the `sample_id` matches between runs (for consistent output naming) |
| 142 | + |
| 143 | +## Future Enhancements |
| 144 | + |
| 145 | +Potential future improvements: |
| 146 | +- Automatic detection of Boltzgen output directories |
| 147 | +- Validation of directory structure before starting |
| 148 | +- Support for partial reuse (e.g., reuse intermediate but not final designs) |
0 commit comments