Adaptive ensemble solver with LOO-CV validated selection#4
Adaptive ensemble solver with LOO-CV validated selection#4
Conversation
New experiment: experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/ This solver addresses Issue #1 by combining three refinement paths (candidate-conditioned, family-switching, enhanced joint pose-marginalized) with leave-one-trial-out cross-validated strategy selection. Key design decisions: - Runs all three refinement paths per observation rather than hard routing - LOO-CV selects the best combination strategy per condition - Every threshold or routing parameter is calibrated on training folds only - Out-of-sample results are reported alongside in-sample for comparison Enhanced joint solver improvements: - Wider temperature anneal: [2.0, 1.0, 0.5] - Final random perturbation stage (8 Gaussian samples) - Solver context cache to avoid redundant forward evaluations Results: - OOS LOO-CV overall mean alpha error: 0.1325 - Same-trial support-gated baseline: 0.1192 - Oracle best-of-three: 0.1056 - Beats issue reference baseline (0.1714) and joint solver (0.1835) LOO-CV confirms: - conditioned path is robustly best for sparse_partial_high_noise - score_competitive works for sparse_full_noisy - enhanced joint does not improve standalone over existing paths - real complementarity exists (oracle 0.1056) but current selectors cannot capture it out-of-sample with available trial count BGP assessment: strengthens the BGP read by providing out-of-sample validation that the bottleneck is a policy-selection problem, not a theory failure. Deliverables: run.py, README.md, CSV/JSON/figure artifacts
There was a problem hiding this comment.
Pull request overview
Adds a new experiment that ensembles three existing refinement paths and uses leave-one-trial-out cross-validation (LOO-CV) to select a strategy per condition, aiming to provide out-of-sample validation for the anisotropic inverse bottleneck.
Changes:
- Introduces
adaptive-ensemble-solver/run.pyto run conditioned, family-switch, and enhanced joint refinement paths and evaluate multiple selection policies under LOO-CV. - Adds an experiment README describing methodology, results, and interpretation.
- Commits trial-level and aggregated CSV/JSON outputs plus overview/cell-level figures.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/run.py | Implements the ensemble evaluation, enhanced joint solver variant, and LOO-CV strategy selection/aggregation/plots. |
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/README.md | Documents the experiment design, LOO-CV procedure, and summarizes results. |
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_trials.csv | Trial-level artifact produced by the new experiment. |
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_summary.json | Machine-readable summary artifact (overall + per-condition + per-cell). |
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_summary.csv | Per-condition summary artifact. |
| experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_cells.csv | Per-cell summary artifact. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| scores = [ | ||
| (r.conditioned_score, r.conditioned_alpha_error), | ||
| (r.family_score, r.family_alpha_error), | ||
| (r.joint_score, r.joint_alpha_error), | ||
| ] | ||
| scores.sort(key=lambda x: x[0]) | ||
| # Use best scoring, but penalize family in sparse_partial | ||
| if r.condition == "sparse_partial_high_noise" and scores[0][0] > r.conditioned_score + threshold: | ||
| margin_errors.append(r.conditioned_alpha_error) | ||
| else: | ||
| margin_errors.append(scores[0][1]) |
There was a problem hiding this comment.
The margin_gated strategy implementation doesn’t match the comment (“If family_score is much lower than conditioned_score, use family; else conditioned”). The current check uses scores[0][0] > r.conditioned_score + threshold where scores[0][0] is the minimum score across all paths, so it doesn’t actually test the conditioned-vs-family margin and can flip behavior in unintuitive ways (especially for negative thresholds). Recommend rewriting this gating to explicitly compare family_score vs conditioned_score (and optionally joint) using a well-defined margin rule, and apply the same rule in the held-out branch.
| scores = [ | |
| (r.conditioned_score, r.conditioned_alpha_error), | |
| (r.family_score, r.family_alpha_error), | |
| (r.joint_score, r.joint_alpha_error), | |
| ] | |
| scores.sort(key=lambda x: x[0]) | |
| # Use best scoring, but penalize family in sparse_partial | |
| if r.condition == "sparse_partial_high_noise" and scores[0][0] > r.conditioned_score + threshold: | |
| margin_errors.append(r.conditioned_alpha_error) | |
| else: | |
| margin_errors.append(scores[0][1]) | |
| # First choose between conditioned and family based on a margin rule, | |
| # then optionally override with joint if it scores even better. | |
| if r.condition == "sparse_partial_high_noise": | |
| # Apply margin gating: require family to beat conditioned by more than `threshold` | |
| if r.family_score + threshold < r.conditioned_score: | |
| best_score = r.family_score | |
| best_error = r.family_alpha_error | |
| else: | |
| best_score = r.conditioned_score | |
| best_error = r.conditioned_alpha_error | |
| else: | |
| # Outside sparse_partial_high_noise, just pick the better of conditioned vs family | |
| if r.family_score < r.conditioned_score: | |
| best_score = r.family_score | |
| best_error = r.family_alpha_error | |
| else: | |
| best_score = r.conditioned_score | |
| best_error = r.conditioned_alpha_error | |
| # Unless joint is even better by score, in which case use joint | |
| if r.joint_score < best_score: | |
| best_score = r.joint_score | |
| best_error = r.joint_alpha_error | |
| margin_errors.append(best_error) |
| def summarize_by_cell(rows: list[TrialRow]) -> list[dict[str, float | str]]: | ||
| summary: list[dict[str, float | str]] = [] | ||
| for condition in FOCUS_CONDITIONS: | ||
| for skew_bin in GEOMETRY_SKEW_BIN_LABELS: | ||
| subset = [row for row in rows if row.condition == condition and row.geometry_skew_bin == skew_bin] | ||
| if not subset: | ||
| continue | ||
|
|
||
| def mean(attr: str) -> float: | ||
| return float(np.mean([getattr(row, attr) for row in subset])) | ||
|
|
||
| oos_errors, oos_mean, oos_strategy = loocv_select(subset) | ||
|
|
||
| summary.append( | ||
| { | ||
| "condition": condition, | ||
| "alpha_strength_bin": FOCUS_ALPHA_BIN, | ||
| "geometry_skew_bin": skew_bin, | ||
| "count": len(subset), | ||
| "support_gated_alpha_error_mean": mean("support_gated_alpha_error"), | ||
| "conditioned_alpha_error_mean": mean("conditioned_alpha_error"), | ||
| "family_alpha_error_mean": mean("family_alpha_error"), | ||
| "joint_alpha_error_mean": mean("joint_alpha_error"), | ||
| "score_competitive_alpha_error_mean": mean("score_competitive_alpha_error"), | ||
| "oracle_three_alpha_error_mean": mean("oracle_three_alpha_error"), | ||
| "oracle_pose_alpha_error_mean": mean("oracle_pose_alpha_error"), | ||
| "oos_loocv_alpha_error_mean": oos_mean, | ||
| "oos_loocv_strategy": oos_strategy, | ||
| "joint_pose_entropy_mean": mean("joint_pose_entropy"), | ||
| } | ||
| ) | ||
| return summary |
There was a problem hiding this comment.
Per the PR description, strategy selection is “per condition”, but summarize_by_cell calls loocv_select on a per-cell subset. That recalibrates the policy using only trials from that single cell, which changes the validation procedure and can produce misleading per-cell OOS values (especially with only 3 trials). If you want per-cell OOS breakdown for a per-condition LOO policy, compute the per-trial held-out predictions once per condition (with training including all other trials in that condition) and then aggregate those held-out errors by geometry_skew_bin.
| else: | ||
| oos_errors.append(test_row.support_gated_alpha_error) | ||
|
|
||
| most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated" |
There was a problem hiding this comment.
oos_loocv_strategy is reported as the most common fold winner, but with small n (e.g., per-cell n=3) ties are likely and max(...) will pick an arbitrary first-inserted key. This makes the reported strategy label potentially misleading vs the actual fold-by-fold selections. Consider returning/reporting the full fold win counts (or the per-trial chosen strategy) instead of a single label.
| most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated" | |
| if strategy_wins: | |
| # Compute the maximum win count and collect all strategies that achieve it. | |
| max_wins = max(strategy_wins.values()) | |
| best_strategies = sorted([name for name, count in strategy_wins.items() if count == max_wins]) | |
| if len(best_strategies) == 1: | |
| # Single clear winner. | |
| most_common = best_strategies[0] | |
| else: | |
| # Explicitly report a tie rather than arbitrarily picking one strategy. | |
| most_common = "tie:" + ",".join(best_strategies) | |
| else: | |
| # Fall back to a sensible default if no strategies were recorded. | |
| most_common = "support_gated" |
| ## Per-cell out-of-sample breakdown | ||
|
|
||
| | Condition | Skew | OOS LOO-CV | Support-gated | Strategy | | ||
| |-----------|------|-----------|---------------|----------| | ||
| | sparse_full_noisy | low_skew | 0.2068 | 0.1271 | joint | | ||
| | sparse_full_noisy | mid_skew | 0.2982 | 0.2210 | family | | ||
| | sparse_full_noisy | high_skew | 0.0202 | 0.0178 | joint | | ||
| | sparse_partial_high_noise | low_skew | 0.1761 | 0.1761 | conditioned | | ||
| | sparse_partial_high_noise | mid_skew | 0.1557 | 0.1557 | conditioned | | ||
| | sparse_partial_high_noise | high_skew | 0.2010 | 0.0174 | conditioned | | ||
|
|
There was a problem hiding this comment.
The per-cell “OOS LOO-CV” table is currently based on running LOO-CV separately within each cell (3 trials), which is not the same as a per-condition LOO policy and can yield unstable / misleading cell-level OOS numbers and strategy labels. If the intended validation is per-condition LOO-CV, it would be clearer to compute held-out predictions per condition once, then break those held-out errors down by cell for reporting.
| evaluate_params, family_switching_refine = load_symbols( | ||
| "run_family_switching_refinement_experiment", | ||
| ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py", | ||
| "evaluate_params", | ||
| "family_switching_refine", | ||
| ) | ||
|
|
There was a problem hiding this comment.
evaluate_params is loaded here but never referenced in this script. Since load_symbols executes the source module, dropping unused symbol loads will reduce startup overhead and make dependencies clearer.
| evaluate_params, family_switching_refine = load_symbols( | |
| "run_family_switching_refinement_experiment", | |
| ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py", | |
| "evaluate_params", | |
| "family_switching_refine", | |
| ) | |
| family_switching_refine, = load_symbols( | |
| "run_family_switching_refinement_experiment", | |
| ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py", | |
| "family_switching_refine", | |
| ) |
| OBSERVATION_REGIMES, SIGNATURE_ANGLE_COUNT, write_csv = load_symbols( | ||
| "run_weighted_multisource_inverse_experiment", | ||
| ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py", | ||
| "OBSERVATION_REGIMES", | ||
| "SIGNATURE_ANGLE_COUNT", |
There was a problem hiding this comment.
SIGNATURE_ANGLE_COUNT is loaded here but never used. Since load_symbols executes the referenced module, consider removing unused symbol loads to reduce startup time and avoid hidden side effects.
| OBSERVATION_REGIMES, SIGNATURE_ANGLE_COUNT, write_csv = load_symbols( | |
| "run_weighted_multisource_inverse_experiment", | |
| ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py", | |
| "OBSERVATION_REGIMES", | |
| "SIGNATURE_ANGLE_COUNT", | |
| OBSERVATION_REGIMES, write_csv = load_symbols( | |
| "run_weighted_multisource_inverse_experiment", | |
| ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py", | |
| "OBSERVATION_REGIMES", |
| import json | ||
| import math | ||
| import os | ||
| from dataclasses import dataclass, field |
There was a problem hiding this comment.
field is imported from dataclasses but never used. Dropping unused imports helps keep this experiment script easier to audit and avoids confusing readers about intended dataclass behavior.
| from dataclasses import dataclass, field | |
| from dataclasses import dataclass |
| state, _ = improve_over_candidates(context, final_temperature, state, shotgun_candidates) | ||
|
|
||
| final_params = state_to_params(state) | ||
| final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, base_temperature) |
There was a problem hiding this comment.
In the shotgun stage you select improvements using final_temperature, but then compute/return final_score using base_temperature. This means the perturbation acceptance criterion can optimize a different objective than the score used to choose the best seed/run (and the score you later compare across seeds). Consider evaluating shotgun candidates with the same temperature you use for the returned score (or return the score at final_temperature consistently) to keep selection consistent.
| final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, base_temperature) | |
| final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, final_temperature) |
| - more anneal stages (5 vs 3) with a wider temperature sweep | ||
| - larger initial search radii (1.5x) |
There was a problem hiding this comment.
The module docstring claims the enhanced joint solver adds “more anneal stages (5 vs 3)” and “larger initial search radii (1.5x)”, but the code uses 3 anneal factors and radii increases of ~1.2–1.25× (vs joint-pose-marginalized-solver). Please update the docstring to match the actual parameters so readers aren’t misled about what changed.
| - more anneal stages (5 vs 3) with a wider temperature sweep | |
| - larger initial search radii (1.5x) | |
| - a slightly wider 3-stage anneal schedule (broader temperature sweep than the baseline) | |
| - moderately larger initial search radii (~1.2–1.25× vs the joint pose-marginalized solver) |
| ALPHA_STRENGTH_BIN_LABELS, GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, evaluate_candidate_alpha, sample_conditioned_parameters, top_k_indices = load_symbols( | ||
| "run_candidate_conditioned_alignment_experiment", | ||
| ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py", | ||
| "ALPHA_STRENGTH_BIN_LABELS", | ||
| "GEOMETRY_SKEW_BIN_LABELS", | ||
| "candidate_conditioned_search", | ||
| "evaluate_candidate_alpha", |
There was a problem hiding this comment.
This load_symbols call imports several names that are never used in this script (e.g., ALPHA_STRENGTH_BIN_LABELS, evaluate_candidate_alpha). Because load_symbols executes the target module, pulling unused symbols adds avoidable import/runtime overhead. Consider only requesting/assigning the symbols that are actually used here.
| ALPHA_STRENGTH_BIN_LABELS, GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, evaluate_candidate_alpha, sample_conditioned_parameters, top_k_indices = load_symbols( | |
| "run_candidate_conditioned_alignment_experiment", | |
| ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py", | |
| "ALPHA_STRENGTH_BIN_LABELS", | |
| "GEOMETRY_SKEW_BIN_LABELS", | |
| "candidate_conditioned_search", | |
| "evaluate_candidate_alpha", | |
| GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, sample_conditioned_parameters, top_k_indices = load_symbols( | |
| "run_candidate_conditioned_alignment_experiment", | |
| ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py", | |
| "GEOMETRY_SKEW_BIN_LABELS", | |
| "candidate_conditioned_search", |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f580b76b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else: | ||
| oos_errors.append(test_row.support_gated_alpha_error) | ||
|
|
||
| most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated" |
There was a problem hiding this comment.
Represent tied LOO winners as mixed strategy
oos_loocv_strategy is derived with max(strategy_wins, ...), which returns the first inserted key when multiple strategies have the same win count. With small fold counts (especially the 3-trial cell summaries), ties are common, so this can report a single strategy like conditioned even when fold winners are split across different strategies. That makes the exported strategy label and downstream interpretation inaccurate; tied winners should be surfaced explicitly (e.g., mixed or a list of tied strategies).
Useful? React with 👍 / 👎.
Summary
Resolves #1 by combining three existing refinement paths with leave-one-trial-out cross-validated (LOO-CV) strategy selection to address the anisotropic inverse bottleneck.
What Changed
Instead of hard-coded routing or a single from-scratch solver, this experiment:
Runs all three refinement paths per observation:
Uses LOO-CV to select the best strategy per condition, with every threshold calibrated only on training folds
Reports both in-sample and out-of-sample metrics
Out-of-Sample Results
Key Findings
LOO-CV confirms conditioned is robustly best for sparse_partial_high_noise - This provides out-of-sample validation that the support-gated baseline's design was correct, not just an in-sample fit.
Enhanced joint solver is not a standalone improvement - Even with wider anneal and shotgun perturbation, the joint solver typically produces worse alpha errors than conditioned or family-switch paths.
Real complementarity exists but is hard to exploit - Oracle best-of-three achieves 0.1056, but LOO-CV cannot reliably capture this with the current trial count.
Remaining error attribution:
BGP Assessment
Strengthens the BGP read. Provides the first out-of-sample validation that:
Deliverables
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/run.pyexperiments/pose-anisotropy-interventions/adaptive-ensemble-solver/README.mdAcceptance Criteria Check