Skip to content

Commit a02d5d1

Browse files
authored
Merge pull request #70 from seqeralabs/seqera-ai/20251201-151008-add-boltzgen-output-reuse
Add boltzgen_output_dir option to reuse expensive Boltzgen results
2 parents 6d7fcf8 + b6f1ac8 commit a02d5d1

File tree

4 files changed

+213
-15
lines changed

4 files changed

+213
-15
lines changed

BOLTZGEN_REUSE_FEATURE.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Boltzgen Output Reuse Feature
2+
3+
## Overview
4+
5+
The `boltzgen_output_dir` feature allows you to skip the computationally expensive Boltzgen step and start the pipeline directly from ProteinMPNN using pre-computed Boltzgen results. This is particularly useful when:
6+
7+
1. **Nextflow cache is invalidated** - Even though only parameters changed, Nextflow sometimes invalidates the Boltzgen cache
8+
2. **Testing downstream analyses** - You want to experiment with ProteinMPNN, Boltz-2, or other analysis parameters without re-running Boltzgen
9+
3. **Iterative refinement** - You're satisfied with Boltzgen designs and want to focus on sequence optimization and refolding
10+
11+
## How It Works
12+
13+
When you provide a `boltzgen_output_dir` in your samplesheet, the pipeline will:
14+
- **Skip running Boltzgen** for that sample
15+
- **Use the existing Boltzgen output directory** as if it was just computed
16+
- **Continue with ProteinMPNN and downstream analyses** using the pre-computed structures
17+
18+
## Samplesheet Configuration
19+
20+
### Without Boltzgen Reuse (Normal Mode)
21+
22+
```csv
23+
sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
24+
my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,
25+
```
26+
27+
### With Boltzgen Reuse (Skip Boltzgen)
28+
29+
```csv
30+
sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
31+
my_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/my_design/boltzgen/my_design_output
32+
```
33+
34+
**Key points:**
35+
- The `boltzgen_output_dir` should point to the Boltzgen output directory (typically `{sample_id}_output`)
36+
- The path can be **relative** (from launch directory or project directory) or **absolute**
37+
- Even when reusing, you must still provide `design_yaml` and `structure_files` (for consistency, though they won't be used)
38+
- The directory must contain the standard Boltzgen output structure
39+
40+
## Expected Boltzgen Output Directory Structure
41+
42+
The `boltzgen_output_dir` should have this structure:
43+
44+
```
45+
my_design_output/
46+
├── final_ranked_designs/
47+
│ ├── final_1_designs/
48+
│ │ ├── rank1_*.cif
49+
│ │ └── rank2_*.cif
50+
│ └── final_2_designs/
51+
│ ├── rank1_*.cif
52+
│ └── rank2_*.cif
53+
├── intermediate_designs/
54+
│ ├── design_*.cif
55+
│ └── design_*.npz
56+
├── intermediate_designs_inverse_folded/
57+
│ └── *.npz
58+
├── aggregate_metrics_analyze.csv
59+
└── per_target_metrics_analyze.csv
60+
```
61+
62+
The pipeline specifically requires:
63+
- `final_ranked_designs/final_*_designs/*.cif` - Budget design CIF files for ProteinMPNN
64+
65+
## Example Use Case
66+
67+
### Step 1: Initial Run with Boltzgen
68+
69+
**samplesheet.csv:**
70+
```csv
71+
sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
72+
2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,
73+
```
74+
75+
**Run pipeline:**
76+
```bash
77+
nextflow run main.nf \
78+
-profile docker \
79+
--input samplesheet.csv \
80+
--outdir results \
81+
--run_proteinmpnn \
82+
--run_boltz2_refold
83+
```
84+
85+
This generates results in: `results/2vsm_design/boltzgen/2vsm_design_output/`
86+
87+
### Step 2: Re-run with Different ProteinMPNN Parameters
88+
89+
Now you want to test different ProteinMPNN parameters but don't want to re-run Boltzgen:
90+
91+
**samplesheet_reuse.csv:**
92+
```csv
93+
sample_id,design_yaml,structure_files,protocol,num_designs,budget,reuse,target_msa,target_sequence,target_template,boltzgen_output_dir
94+
2vsm_design,designs/2vsm.yaml,structures/2VSM.cif,protein-anything,5,2,false,,targets/target.fasta,,results/2vsm_design/boltzgen/2vsm_design_output
95+
```
96+
97+
**Run pipeline with new parameters:**
98+
```bash
99+
nextflow run main.nf \
100+
-profile docker \
101+
--input samplesheet_reuse.csv \
102+
--outdir results_retest \
103+
--run_proteinmpnn \
104+
--run_boltz2_refold \
105+
--mpnn_num_seqs 16 # Try more sequences
106+
```
107+
108+
### Step 3: Compare Results
109+
110+
You can now compare the outputs from different downstream analyses while using the same Boltzgen designs.
111+
112+
## Benefits
113+
114+
1. **💰 Cost Savings** - Boltzgen is GPU-intensive; skipping it saves compute costs
115+
2. **⏱️ Time Savings** - Typical Boltzgen run: 30-60 minutes; this feature: instant start
116+
3. **🔬 Experimentation** - Test multiple downstream parameter combinations efficiently
117+
4. **🛡️ Cache Safety** - Preserve expensive results even when Nextflow cache is invalidated
118+
5. **📊 Reproducibility** - Use exact same Boltzgen designs across multiple analysis runs
119+
120+
## Important Notes
121+
122+
- The `boltzgen_output_dir` field is **optional** - leave it blank for normal Boltzgen execution
123+
- When provided, Boltzgen will be completely skipped for that sample
124+
- You can mix samples with and without `boltzgen_output_dir` in the same samplesheet
125+
- The directory structure must match what Boltzgen produces
126+
- ProteinMPNN and downstream analyses will work identically whether Boltzgen was just run or reused
127+
128+
## Troubleshooting
129+
130+
### Error: "Cannot find directory"
131+
- Check that the path to `boltzgen_output_dir` is correct
132+
- Use absolute path if relative path isn't working
133+
- Verify the directory exists and has proper permissions
134+
135+
### Error: "No CIF files found"
136+
- Ensure the directory structure matches expected Boltzgen output
137+
- Check that `final_ranked_designs/final_*_designs/*.cif` files exist
138+
139+
### Unexpected behavior
140+
- Verify that the pre-computed Boltzgen results match the expected design
141+
- Check that the `sample_id` matches between runs (for consistent output naming)
142+
143+
## Future Enhancements
144+
145+
Potential future improvements:
146+
- Automatic detection of Boltzgen output directories
147+
- Validation of directory structure before starting
148+
- Support for partial reuse (e.g., reuse intermediate but not final designs)

assets/schema_input_design.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,10 @@
5757
"type": "string",
5858
"pattern": "^\\S+\\.cif$",
5959
"errorMessage": "Target template must be a valid file path to a CIF file (e.g., 'target_structure.cif')"
60+
},
61+
"boltzgen_output_dir": {
62+
"type": "string",
63+
"errorMessage": "Boltzgen output directory must be a valid directory path to pre-computed Boltzgen results (e.g., 'results/sample1/boltzgen/sample1_output')"
6064
}
6165
},
6266
"required": [

main.nf

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ workflow NFPROTEINDESIGN {
9999
.fromList(design_samplesheet)
100100
.map { tuple ->
101101
// samplesheetToList returns list of values in schema order
102-
// Order: sample_id, design_yaml, structure_files, protocol, num_designs, budget, reuse, target_msa, target_sequence, target_template
102+
// Order: sample_id, design_yaml, structure_files, protocol, num_designs, budget, reuse, target_msa, target_sequence, target_template, boltzgen_output_dir
103103
def sample_id = tuple[0]
104104
def design_yaml_path = tuple[1]
105105
def structure_files_str = tuple[2]
@@ -110,6 +110,7 @@ workflow NFPROTEINDESIGN {
110110
def target_msa_path = tuple.size() > 7 ? tuple[7] : null
111111
def target_sequence_path = tuple.size() > 8 ? tuple[8] : null
112112
def target_template_path = tuple.size() > 9 ? tuple[9] : null
113+
def boltzgen_output_dir_path = tuple.size() > 10 ? tuple[10] : null
113114

114115
// Convert design YAML to file object and validate existence
115116
// Smart path resolution: try launchDir first (for local runs), then projectDir (for Platform)
@@ -191,14 +192,29 @@ workflow NFPROTEINDESIGN {
191192
}
192193
}
193194

195+
// Parse boltzgen_output_dir if provided
196+
def boltzgen_output_dir = null
197+
if (boltzgen_output_dir_path) {
198+
if (boltzgen_output_dir_path.startsWith('/') || boltzgen_output_dir_path.contains('://')) {
199+
boltzgen_output_dir = file(boltzgen_output_dir_path, type: 'dir', checkIfExists: true)
200+
} else {
201+
def launchDir_path = file(boltzgen_output_dir_path, type: 'dir')
202+
if (launchDir_path.exists()) {
203+
boltzgen_output_dir = launchDir_path
204+
} else {
205+
boltzgen_output_dir = file("${project_dir}/${boltzgen_output_dir_path}", type: 'dir', checkIfExists: true)
206+
}
207+
}
208+
}
209+
194210
def meta = [:]
195211
meta.id = sample_id
196212
meta.protocol = protocol
197213
meta.num_designs = num_designs
198214
meta.budget = budget
199215
meta.reuse = reuse ?: false
200216

201-
[meta, design_yaml, structure_files, target_msa, target_sequence, target_template]
217+
[meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir]
202218
}
203219

204220
// ========================================================================

workflows/protein_design.nf

Lines changed: 43 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,52 @@ include { CONSOLIDATE_METRICS } from '../modules/local/consolidate_metrics'
1919
workflow PROTEIN_DESIGN {
2020

2121
take:
22-
ch_input // channel: [meta, design_yaml, structure_files, target_msa, target_sequence, target_template]
22+
ch_input // channel: [meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir]
2323
ch_cache // channel: path to cache directory or EMPTY_CACHE placeholder
2424
ch_boltz2_cache // channel: path to Boltz-2 cache directory or EMPTY_BOLTZ2_CACHE placeholder
2525

2626
main:
2727

2828
// ========================================================================
29-
// Run Boltzgen on design YAMLs
29+
// Run Boltzgen on design YAMLs OR use pre-computed results
3030
// ========================================================================
3131

32-
// Prepare Boltzgen input by removing target_msa, target_sequence, and target_template (not needed for Boltzgen)
33-
ch_boltzgen_input = ch_input
34-
.map { meta, design_yaml, structure_files, target_msa, target_sequence, target_template ->
35-
[meta, design_yaml, structure_files]
32+
// Split input channel into two branches: with and without pre-computed Boltzgen results
33+
ch_input
34+
.branch { meta, design_yaml, structure_files, target_msa, target_sequence, target_template, boltzgen_output_dir ->
35+
with_precomputed: boltzgen_output_dir != null
36+
return [meta, boltzgen_output_dir]
37+
needs_boltzgen: boltzgen_output_dir == null
38+
return [meta, design_yaml, structure_files]
39+
}
40+
.set { ch_branched }
41+
42+
// Run Boltzgen only for samples without pre-computed results
43+
BOLTZGEN_RUN(ch_branched.needs_boltzgen, ch_cache)
44+
45+
// Create channel from pre-computed Boltzgen output directories
46+
ch_precomputed_boltzgen = ch_branched.with_precomputed
47+
.map { meta, boltzgen_dir ->
48+
// Stage the pre-computed directory as if it came from BOLTZGEN_RUN
49+
[meta, boltzgen_dir]
50+
}
51+
52+
// Combine Boltzgen results from both sources (newly run + pre-computed)
53+
ch_boltzgen_results = BOLTZGEN_RUN.out.results
54+
.mix(ch_precomputed_boltzgen)
55+
56+
// Extract budget_design_cifs from both sources for downstream processing
57+
ch_budget_cifs_new = BOLTZGEN_RUN.out.budget_design_cifs
58+
59+
ch_budget_cifs_precomputed = ch_branched.with_precomputed
60+
.map { meta, boltzgen_dir ->
61+
// Extract budget design CIF files from pre-computed directory
62+
def budget_cifs = file("${boltzgen_dir}/final_ranked_designs/final_*_designs/*.cif")
63+
[meta, budget_cifs]
3664
}
3765

38-
// Run Boltzgen for each design in parallel
39-
BOLTZGEN_RUN(ch_boltzgen_input, ch_cache)
66+
ch_budget_design_cifs = ch_budget_cifs_new
67+
.mix(ch_budget_cifs_precomputed)
4068

4169
// ========================================================================
4270
// ProteinMPNN: Optimize sequences for designed structures
@@ -45,7 +73,8 @@ workflow PROTEIN_DESIGN {
4573
// Step 1: Convert CIF structures to PDB format (ProteinMPNN requires PDB)
4674
// Use budget_design_cifs which contains ONLY the budget designs (e.g., 2 structures if budget=2)
4775
// NOT all designs from results directory
48-
CONVERT_CIF_TO_PDB(BOLTZGEN_RUN.out.budget_design_cifs)
76+
// Use the combined channel that includes both newly computed and pre-computed Boltzgen results
77+
CONVERT_CIF_TO_PDB(ch_budget_design_cifs)
4978

5079
// Step 2: Parallelize ProteinMPNN - run separately for each budget design
5180
// Use flatMap to create individual tasks per PDB file (one per budget iteration)
@@ -179,7 +208,8 @@ workflow PROTEIN_DESIGN {
179208
}
180209
} else {
181210
// Use Boltzgen outputs directly if ProteinMPNN is disabled
182-
ch_final_designs_for_analysis = BOLTZGEN_RUN.out.results
211+
// Use the combined channel that includes both newly computed and pre-computed results
212+
ch_final_designs_for_analysis = ch_boltzgen_results
183213
}
184214

185215
// ========================================================================
@@ -398,9 +428,9 @@ workflow PROTEIN_DESIGN {
398428
}
399429

400430
emit:
401-
// Boltzgen outputs
402-
boltzgen_results = BOLTZGEN_RUN.out.results
403-
final_designs = BOLTZGEN_RUN.out.final_designs
431+
// Boltzgen outputs (combined from both newly computed and pre-computed sources)
432+
boltzgen_results = ch_boltzgen_results
433+
final_designs = ch_budget_design_cifs
404434

405435
// ProteinMPNN outputs (will be empty if not run)
406436
mpnn_optimized = params.run_proteinmpnn ? PROTEINMPNN_OPTIMIZE.out.optimized_designs : Channel.empty()

0 commit comments

Comments
 (0)