|
| 1 | +# Metrics Consolidation Update |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Updated the `CONSOLIDATE_METRICS` process and `consolidate_design_metrics.py` script to comprehensively collect and rank all pipeline outputs. The updated consolidation now provides a complete view of design quality across all analysis stages. |
| 6 | + |
| 7 | +## What Was Changed |
| 8 | + |
| 9 | +### Enhanced Metric Collection |
| 10 | + |
| 11 | +The consolidation script now collects metrics from **all pipeline stages**: |
| 12 | + |
| 13 | +#### 1. **Boltzgen Original Design Quality** |
| 14 | +- `aggregate_plddt` - Per-residue confidence (0-100) |
| 15 | +- `aggregate_ptm` - Predicted TM-score (0-1) |
| 16 | +- `aggregate_iptm` - Interface predicted TM-score (0-1) |
| 17 | +- `aggregate_pae_interaction` - Interface PAE score |
| 18 | +- All fields from `aggregate_metrics_analyze.csv` |
| 19 | +- All fields from `per_target_metrics_analyze.csv` |
| 20 | + |
| 21 | +#### 2. **ProteinMPNN Sequence Optimization** |
| 22 | +- `mpnn_score` - Negative log probability (lower is better) |
| 23 | +- `mpnn_global_score` - Overall sequence likelihood |
| 24 | +- `mpnn_seq_recovery` - Fraction of original residues kept (0-1) |
| 25 | +- `mpnn_num_sequences` - Number of optimized sequences |
| 26 | +- Parsed from `*_scores.fa` FASTA files |
| 27 | + |
| 28 | +#### 3. **Protenix Refolding Validation** |
| 29 | +- `protenix_plddt` - Confidence after refolding (0-100) |
| 30 | +- `protenix_ptm` - Predicted TM-score after refolding (0-1) |
| 31 | +- `protenix_iptm` - Interface quality after refolding (0-1) |
| 32 | +- `protenix_ranking_score` - Overall model ranking |
| 33 | +- Parsed from confidence JSON files |
| 34 | + |
| 35 | +#### 4. **IPSAE Interface Quality** |
| 36 | +- `ipsae_score` - Interface PAE score (lower is better, <5 excellent) |
| 37 | +- Runs on ALL budget designs (before filtering) |
| 38 | + |
| 39 | +#### 5. **PRODIGY Binding Affinity** |
| 40 | +- `predicted_binding_affinity` - ΔG in kcal/mol (more negative = stronger) |
| 41 | +- `predicted_kd` - Dissociation constant in M |
| 42 | +- `buried_surface_area` - Interface size in Ų |
| 43 | +- `num_interface_contacts` - Number of residue contacts |
| 44 | + |
| 45 | +#### 6. **Foldseek Structural Similarity** |
| 46 | +- `foldseek_top_hit` - Most similar structure in database |
| 47 | +- `foldseek_top_evalue` - Statistical significance |
| 48 | +- `foldseek_top_bits` - Alignment score |
| 49 | +- `foldseek_num_hits` - Total number of hits |
| 50 | + |
| 51 | +## New Composite Scoring System |
| 52 | + |
| 53 | +The composite score now weighs **all available metrics** with appropriate weights: |
| 54 | + |
| 55 | +```python |
| 56 | +weights = { |
| 57 | + # Boltzgen structure quality |
| 58 | + 'aggregate_plddt': 0.15, |
| 59 | + 'aggregate_ptm': 1.0, |
| 60 | + 'aggregate_iptm': 1.0, |
| 61 | + |
| 62 | + # Interface quality |
| 63 | + 'ipsae_score': -2.0, # Lower is better |
| 64 | + |
| 65 | + # Binding affinity |
| 66 | + 'predicted_binding_affinity': -0.5, # More negative is better |
| 67 | + 'buried_surface_area': 0.001, |
| 68 | + 'num_interface_contacts': 0.05, |
| 69 | + |
| 70 | + # ProteinMPNN optimization |
| 71 | + 'mpnn_score': -0.5, # Lower is better |
| 72 | + 'mpnn_seq_recovery': 0.5, |
| 73 | + |
| 74 | + # Protenix refolding validation |
| 75 | + 'protenix_plddt': 0.01, |
| 76 | + 'protenix_ptm': 0.5, |
| 77 | + 'protenix_iptm': 0.5, |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +The score is normalized by the number of available metrics, so designs are fairly ranked even if some analyses weren't run. |
| 82 | + |
| 83 | +## Output Structure |
| 84 | + |
| 85 | +### CSV Output (`design_metrics_summary.csv`) |
| 86 | + |
| 87 | +Columns are prioritized for easy analysis: |
| 88 | +1. **Identification**: design_id, model_id, rank |
| 89 | +2. **Overall Score**: composite_score, _metrics_used |
| 90 | +3. **Boltzgen Quality**: aggregate_plddt, aggregate_ptm, aggregate_iptm, etc. |
| 91 | +4. **ProteinMPNN**: mpnn_score, mpnn_seq_recovery, etc. |
| 92 | +5. **Protenix**: protenix_plddt, protenix_ptm, protenix_iptm |
| 93 | +6. **Interface**: ipsae_score |
| 94 | +7. **Binding**: predicted_binding_affinity, predicted_kd, buried_surface_area, contacts |
| 95 | +8. **Similarity**: foldseek_top_hit, foldseek_top_evalue, etc. |
| 96 | +9. **Additional**: All other metrics from Boltzgen CSVs |
| 97 | + |
| 98 | +### Markdown Report (`design_metrics_report.md`) |
| 99 | + |
| 100 | +Enhanced report includes: |
| 101 | + |
| 102 | +1. **Summary Statistics** - Distribution of metrics across all designs |
| 103 | +2. **Top N Designs Table** - Key metrics at a glance |
| 104 | +3. **Interpretation Guide** - Detailed explanation of each metric category: |
| 105 | + - Boltzgen quality metrics |
| 106 | + - ProteinMPNN optimization |
| 107 | + - Protenix refolding validation |
| 108 | + - Interface quality (IPSAE) |
| 109 | + - Binding affinity (PRODIGY) |
| 110 | + - Structural similarity (Foldseek) |
| 111 | +4. **Recommendations** - Detailed analysis of top design: |
| 112 | + - Quality assessment with thresholds |
| 113 | + - Strengths and considerations |
| 114 | + - Actionable next steps |
| 115 | + |
| 116 | +## Technical Implementation |
| 117 | + |
| 118 | +### Hierarchical Data Collection |
| 119 | + |
| 120 | +The script now uses a hierarchical structure to organize metrics: |
| 121 | + |
| 122 | +``` |
| 123 | +all_metrics = { |
| 124 | + 'design_id': { |
| 125 | + 'boltzgen': {...}, # Base design metrics |
| 126 | + 'model_id_1': {...}, # Metrics for specific model |
| 127 | + 'model_id_2': {...}, |
| 128 | + 'protenix_seq1_model1': {...}, # Protenix refolded structures |
| 129 | + } |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +This is then flattened for ranking: |
| 134 | + |
| 135 | +``` |
| 136 | +flattened_metrics = { |
| 137 | + 'design_id_model_id_1': { |
| 138 | + # Boltzgen base metrics |
| 139 | + # + Model-specific metrics (IPSAE, PRODIGY, Foldseek) |
| 140 | + }, |
| 141 | + 'design_id_protenix_seq1_model1': { |
| 142 | + # Boltzgen base metrics |
| 143 | + # + ProteinMPNN metrics |
| 144 | + # + Protenix metrics |
| 145 | + # + IPSAE, PRODIGY, Foldseek (if run) |
| 146 | + } |
| 147 | +} |
| 148 | +``` |
| 149 | + |
| 150 | +### Path-Based Metric Association |
| 151 | + |
| 152 | +Metrics are correctly associated with their source structures using path parsing: |
| 153 | + |
| 154 | +- **Boltzgen designs**: `{design_id}/intermediate_designs_inverse_folded/{model_id}.cif` |
| 155 | +- **IPSAE scores**: `{design_id}/ipsae_scores/{model_id}_10_10.txt` |
| 156 | +- **PRODIGY**: `{design_id}/prodigy/{model_id}_prodigy_summary.csv` |
| 157 | +- **Foldseek**: `{design_id}/foldseek/{model_id}_foldseek_summary.tsv` |
| 158 | +- **ProteinMPNN**: `{design_id}_mpnn_optimized/{model_id}_scores.fa` |
| 159 | +- **Protenix**: `{design_id}_mpnn_{seq_num}/protenix/{model_id}_confidence.json` |
| 160 | + |
| 161 | +## Benefits |
| 162 | + |
| 163 | +### For Users |
| 164 | + |
| 165 | +1. **Complete Picture**: All pipeline metrics in one table |
| 166 | +2. **Smart Ranking**: Composite score considers all available data |
| 167 | +3. **Easy Filtering**: CSV format allows custom sorting/filtering |
| 168 | +4. **Clear Guidance**: Markdown report explains what each metric means |
| 169 | + |
| 170 | +### For Pipeline Development |
| 171 | + |
| 172 | +1. **Validates All Tools**: Ensures every analysis contributes to final ranking |
| 173 | +2. **Tracks Provenance**: Clear association between structures and metrics |
| 174 | +3. **Extensible**: Easy to add new metrics in the future |
| 175 | +4. **Debuggable**: Verbose output shows what was found at each step |
| 176 | + |
| 177 | +## Example Workflow |
| 178 | + |
| 179 | +After running the pipeline with all modules enabled: |
| 180 | + |
| 181 | +```bash |
| 182 | +nextflow run main.nf \ |
| 183 | + --input samplesheet.csv \ |
| 184 | + --run_proteinmpnn \ |
| 185 | + --run_protenix_refold \ |
| 186 | + --run_ipsae \ |
| 187 | + --run_prodigy \ |
| 188 | + --run_foldseek \ |
| 189 | + --run_consolidation |
| 190 | +``` |
| 191 | + |
| 192 | +You'll get: |
| 193 | + |
| 194 | +1. **`design_metrics_summary.csv`** - Comprehensive table for custom analysis |
| 195 | +2. **`design_metrics_report.md`** - Human-readable report with recommendations |
| 196 | + |
| 197 | +The top-ranked designs will be those that: |
| 198 | +- Have high Boltzgen quality (pLDDT, pTM, ipTM) |
| 199 | +- Show good ProteinMPNN scores (optimized sequences) |
| 200 | +- Refold well with Protenix (validates MPNN sequences) |
| 201 | +- Have low IPSAE scores (confident interface) |
| 202 | +- Show strong predicted binding (PRODIGY ΔG) |
| 203 | +- Have large, well-packed interfaces (BSA, contacts) |
| 204 | + |
| 205 | +## Files Modified |
| 206 | + |
| 207 | +- `assets/consolidate_design_metrics.py` - Complete rewrite of metric collection and ranking logic |
| 208 | + |
| 209 | +## Next Steps |
| 210 | + |
| 211 | +To use the updated consolidation: |
| 212 | + |
| 213 | +1. Run the pipeline with `--run_consolidation` enabled |
| 214 | +2. Review `design_metrics_report.md` for quick insights |
| 215 | +3. Open `design_metrics_summary.csv` for detailed analysis |
| 216 | +4. Sort/filter the CSV by specific metrics of interest |
| 217 | +5. Examine structures for top-ranked designs |
| 218 | +6. Compare Boltzgen vs Protenix structures to validate MPNN sequences |
| 219 | + |
| 220 | +## Notes |
| 221 | + |
| 222 | +- The consolidation runs **after** all analyses complete (triggered by `collect()` on all outputs) |
| 223 | +- If a metric is not available (e.g., Protenix not run), designs are still ranked fairly |
| 224 | +- The `_metrics_used` column shows how many metrics contributed to each score |
| 225 | +- All original Boltzgen CSV fields are preserved in the output |
0 commit comments