Skip to content

Commit 8a1ace3

Browse files
authored
Merge pull request #40 from seqeralabs/seqera-ai/20251121-144100-comprehensive-metrics-consolidation
feat: Comprehensive metrics consolidation with all pipeline outputs
2 parents a9dd64a + 2c83840 commit 8a1ace3

File tree

2 files changed

+816
-182
lines changed

2 files changed

+816
-182
lines changed

METRICS_CONSOLIDATION_UPDATE.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Metrics Consolidation Update
2+
3+
## Overview
4+
5+
Updated the `CONSOLIDATE_METRICS` process and `consolidate_design_metrics.py` script to comprehensively collect and rank all pipeline outputs. The updated consolidation now provides a complete view of design quality across all analysis stages.
6+
7+
## What Was Changed
8+
9+
### Enhanced Metric Collection
10+
11+
The consolidation script now collects metrics from **all pipeline stages**:
12+
13+
#### 1. **Boltzgen Original Design Quality**
14+
- `aggregate_plddt` - Per-residue confidence (0-100)
15+
- `aggregate_ptm` - Predicted TM-score (0-1)
16+
- `aggregate_iptm` - Interface predicted TM-score (0-1)
17+
- `aggregate_pae_interaction` - Interface PAE score
18+
- All fields from `aggregate_metrics_analyze.csv`
19+
- All fields from `per_target_metrics_analyze.csv`
20+
21+
#### 2. **ProteinMPNN Sequence Optimization**
22+
- `mpnn_score` - Negative log probability (lower is better)
23+
- `mpnn_global_score` - Overall sequence likelihood
24+
- `mpnn_seq_recovery` - Fraction of original residues kept (0-1)
25+
- `mpnn_num_sequences` - Number of optimized sequences
26+
- Parsed from `*_scores.fa` FASTA files
27+
28+
#### 3. **Protenix Refolding Validation**
29+
- `protenix_plddt` - Confidence after refolding (0-100)
30+
- `protenix_ptm` - Predicted TM-score after refolding (0-1)
31+
- `protenix_iptm` - Interface quality after refolding (0-1)
32+
- `protenix_ranking_score` - Overall model ranking
33+
- Parsed from confidence JSON files
34+
35+
#### 4. **IPSAE Interface Quality**
36+
- `ipsae_score` - Interface PAE score (lower is better, <5 excellent)
37+
- Runs on ALL budget designs (before filtering)
38+
39+
#### 5. **PRODIGY Binding Affinity**
40+
- `predicted_binding_affinity` - ΔG in kcal/mol (more negative = stronger)
41+
- `predicted_kd` - Dissociation constant in M
42+
- `buried_surface_area` - Interface size in Ų
43+
- `num_interface_contacts` - Number of residue contacts
44+
45+
#### 6. **Foldseek Structural Similarity**
46+
- `foldseek_top_hit` - Most similar structure in database
47+
- `foldseek_top_evalue` - Statistical significance
48+
- `foldseek_top_bits` - Alignment score
49+
- `foldseek_num_hits` - Total number of hits
50+
51+
## New Composite Scoring System
52+
53+
The composite score now weighs **all available metrics** with appropriate weights:
54+
55+
```python
56+
weights = {
57+
# Boltzgen structure quality
58+
'aggregate_plddt': 0.15,
59+
'aggregate_ptm': 1.0,
60+
'aggregate_iptm': 1.0,
61+
62+
# Interface quality
63+
'ipsae_score': -2.0, # Lower is better
64+
65+
# Binding affinity
66+
'predicted_binding_affinity': -0.5, # More negative is better
67+
'buried_surface_area': 0.001,
68+
'num_interface_contacts': 0.05,
69+
70+
# ProteinMPNN optimization
71+
'mpnn_score': -0.5, # Lower is better
72+
'mpnn_seq_recovery': 0.5,
73+
74+
# Protenix refolding validation
75+
'protenix_plddt': 0.01,
76+
'protenix_ptm': 0.5,
77+
'protenix_iptm': 0.5,
78+
}
79+
```
80+
81+
The score is normalized by the number of available metrics, so designs are fairly ranked even if some analyses weren't run.
82+
83+
## Output Structure
84+
85+
### CSV Output (`design_metrics_summary.csv`)
86+
87+
Columns are prioritized for easy analysis:
88+
1. **Identification**: design_id, model_id, rank
89+
2. **Overall Score**: composite_score, _metrics_used
90+
3. **Boltzgen Quality**: aggregate_plddt, aggregate_ptm, aggregate_iptm, etc.
91+
4. **ProteinMPNN**: mpnn_score, mpnn_seq_recovery, etc.
92+
5. **Protenix**: protenix_plddt, protenix_ptm, protenix_iptm
93+
6. **Interface**: ipsae_score
94+
7. **Binding**: predicted_binding_affinity, predicted_kd, buried_surface_area, contacts
95+
8. **Similarity**: foldseek_top_hit, foldseek_top_evalue, etc.
96+
9. **Additional**: All other metrics from Boltzgen CSVs
97+
98+
### Markdown Report (`design_metrics_report.md`)
99+
100+
Enhanced report includes:
101+
102+
1. **Summary Statistics** - Distribution of metrics across all designs
103+
2. **Top N Designs Table** - Key metrics at a glance
104+
3. **Interpretation Guide** - Detailed explanation of each metric category:
105+
- Boltzgen quality metrics
106+
- ProteinMPNN optimization
107+
- Protenix refolding validation
108+
- Interface quality (IPSAE)
109+
- Binding affinity (PRODIGY)
110+
- Structural similarity (Foldseek)
111+
4. **Recommendations** - Detailed analysis of top design:
112+
- Quality assessment with thresholds
113+
- Strengths and considerations
114+
- Actionable next steps
115+
116+
## Technical Implementation
117+
118+
### Hierarchical Data Collection
119+
120+
The script now uses a hierarchical structure to organize metrics:
121+
122+
```
123+
all_metrics = {
124+
'design_id': {
125+
'boltzgen': {...}, # Base design metrics
126+
'model_id_1': {...}, # Metrics for specific model
127+
'model_id_2': {...},
128+
'protenix_seq1_model1': {...}, # Protenix refolded structures
129+
}
130+
}
131+
```
132+
133+
This is then flattened for ranking:
134+
135+
```
136+
flattened_metrics = {
137+
'design_id_model_id_1': {
138+
# Boltzgen base metrics
139+
# + Model-specific metrics (IPSAE, PRODIGY, Foldseek)
140+
},
141+
'design_id_protenix_seq1_model1': {
142+
# Boltzgen base metrics
143+
# + ProteinMPNN metrics
144+
# + Protenix metrics
145+
# + IPSAE, PRODIGY, Foldseek (if run)
146+
}
147+
}
148+
```
149+
150+
### Path-Based Metric Association
151+
152+
Metrics are correctly associated with their source structures using path parsing:
153+
154+
- **Boltzgen designs**: `{design_id}/intermediate_designs_inverse_folded/{model_id}.cif`
155+
- **IPSAE scores**: `{design_id}/ipsae_scores/{model_id}_10_10.txt`
156+
- **PRODIGY**: `{design_id}/prodigy/{model_id}_prodigy_summary.csv`
157+
- **Foldseek**: `{design_id}/foldseek/{model_id}_foldseek_summary.tsv`
158+
- **ProteinMPNN**: `{design_id}_mpnn_optimized/{model_id}_scores.fa`
159+
- **Protenix**: `{design_id}_mpnn_{seq_num}/protenix/{model_id}_confidence.json`
160+
161+
## Benefits
162+
163+
### For Users
164+
165+
1. **Complete Picture**: All pipeline metrics in one table
166+
2. **Smart Ranking**: Composite score considers all available data
167+
3. **Easy Filtering**: CSV format allows custom sorting/filtering
168+
4. **Clear Guidance**: Markdown report explains what each metric means
169+
170+
### For Pipeline Development
171+
172+
1. **Validates All Tools**: Ensures every analysis contributes to final ranking
173+
2. **Tracks Provenance**: Clear association between structures and metrics
174+
3. **Extensible**: Easy to add new metrics in the future
175+
4. **Debuggable**: Verbose output shows what was found at each step
176+
177+
## Example Workflow
178+
179+
After running the pipeline with all modules enabled:
180+
181+
```bash
182+
nextflow run main.nf \
183+
--input samplesheet.csv \
184+
--run_proteinmpnn \
185+
--run_protenix_refold \
186+
--run_ipsae \
187+
--run_prodigy \
188+
--run_foldseek \
189+
--run_consolidation
190+
```
191+
192+
You'll get:
193+
194+
1. **`design_metrics_summary.csv`** - Comprehensive table for custom analysis
195+
2. **`design_metrics_report.md`** - Human-readable report with recommendations
196+
197+
The top-ranked designs will be those that:
198+
- Have high Boltzgen quality (pLDDT, pTM, ipTM)
199+
- Show good ProteinMPNN scores (optimized sequences)
200+
- Refold well with Protenix (validates MPNN sequences)
201+
- Have low IPSAE scores (confident interface)
202+
- Show strong predicted binding (PRODIGY ΔG)
203+
- Have large, well-packed interfaces (BSA, contacts)
204+
205+
## Files Modified
206+
207+
- `assets/consolidate_design_metrics.py` - Complete rewrite of metric collection and ranking logic
208+
209+
## Next Steps
210+
211+
To use the updated consolidation:
212+
213+
1. Run the pipeline with `--run_consolidation` enabled
214+
2. Review `design_metrics_report.md` for quick insights
215+
3. Open `design_metrics_summary.csv` for detailed analysis
216+
4. Sort/filter the CSV by specific metrics of interest
217+
5. Examine structures for top-ranked designs
218+
6. Compare Boltzgen vs Protenix structures to validate MPNN sequences
219+
220+
## Notes
221+
222+
- The consolidation runs **after** all analyses complete (triggered by `collect()` on all outputs)
223+
- If a metric is not available (e.g., Protenix not run), designs are still ranked fairly
224+
- The `_metrics_used` column shows how many metrics contributed to each score
225+
- All original Boltzgen CSV fields are preserved in the output

0 commit comments

Comments
 (0)