Skip to content

Adaptive ensemble solver with LOO-CV validated selection#4

Open
zfifteen wants to merge 1 commit intomainfrom
solver/adaptive-ensemble-loocv
Open

Adaptive ensemble solver with LOO-CV validated selection#4
zfifteen wants to merge 1 commit intomainfrom
solver/adaptive-ensemble-loocv

Conversation

@zfifteen
Copy link
Copy Markdown
Owner

Summary

Resolves #1 by combining three existing refinement paths with leave-one-trial-out cross-validated (LOO-CV) strategy selection to address the anisotropic inverse bottleneck.

What Changed

Instead of hard-coded routing or a single from-scratch solver, this experiment:

  1. Runs all three refinement paths per observation:

    • Fixed-family candidate-conditioned shift/alpha search
    • Geometry-plus-alpha family switching
    • Enhanced joint pose-marginalized local search (with wider anneal schedule and random perturbation stage)
  2. Uses LOO-CV to select the best strategy per condition, with every threshold calibrated only on training folds

  3. Reports both in-sample and out-of-sample metrics

Out-of-Sample Results

Condition OOS LOO-CV Same-trial support-gated Issue ref baseline
sparse_full_noisy 0.1486 0.1220 0.1233
sparse_partial_high_noise 0.1164 0.1164 0.2195
Overall 0.1325 0.1192 0.1714
  • Beats issue reference baseline (0.1714) by 22.7%
  • Beats issue reference joint solver (0.1835) by 27.8%
  • Does not beat same-trial support-gated (0.1192), as expected with LOO-CV holdout penalty

Key Findings

  1. LOO-CV confirms conditioned is robustly best for sparse_partial_high_noise - This provides out-of-sample validation that the support-gated baseline's design was correct, not just an in-sample fit.

  2. Enhanced joint solver is not a standalone improvement - Even with wider anneal and shotgun perturbation, the joint solver typically produces worse alpha errors than conditioned or family-switch paths.

  3. Real complementarity exists but is hard to exploit - Oracle best-of-three achieves 0.1056, but LOO-CV cannot reliably capture this with the current trial count.

  4. Remaining error attribution:

    • sparse_full, mid_skew: pose uncertainty dominates (high entropy ~0.55)
    • sparse_partial: support-driven, not pose-driven
    • The bottleneck is narrowed to a policy-selection problem in sparse_full

BGP Assessment

Strengthens the BGP read. Provides the first out-of-sample validation that:

  • The bottleneck is a solver-design / policy-selection problem, not a theory failure
  • The conditioned path is robustly optimal for partial-support conditions
  • Complementary signal exists but requires more trials to exploit reliably

Deliverables

  • experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/run.py
  • experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/README.md
  • Output CSV/JSON artifacts
  • Overview and cell-level figures

Acceptance Criteria Check

  • Minimum: beat joint solver overall mean (0.1835) -- OOS 0.1325 < 0.1835
  • Primary: beat support-aware baseline overall mean (0.1714) -- OOS 0.1325 < 0.1714
  • Stretch: close gap toward oracle best-of-two (0.1281) -- OOS 0.1325 is close but does not fully reach oracle level
  • Out-of-sample validation via LOO-CV
  • Per-condition and per-cell alpha error reporting
  • Error attribution (pose uncertainty vs support limitation)
  • Plain-language BGP assessment

New experiment: experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/

This solver addresses Issue #1 by combining three refinement paths
(candidate-conditioned, family-switching, enhanced joint pose-marginalized)
with leave-one-trial-out cross-validated strategy selection.

Key design decisions:
- Runs all three refinement paths per observation rather than hard routing
- LOO-CV selects the best combination strategy per condition
- Every threshold or routing parameter is calibrated on training folds only
- Out-of-sample results are reported alongside in-sample for comparison

Enhanced joint solver improvements:
- Wider temperature anneal: [2.0, 1.0, 0.5]
- Final random perturbation stage (8 Gaussian samples)
- Solver context cache to avoid redundant forward evaluations

Results:
- OOS LOO-CV overall mean alpha error: 0.1325
- Same-trial support-gated baseline: 0.1192
- Oracle best-of-three: 0.1056
- Beats issue reference baseline (0.1714) and joint solver (0.1835)

LOO-CV confirms:
- conditioned path is robustly best for sparse_partial_high_noise
- score_competitive works for sparse_full_noisy
- enhanced joint does not improve standalone over existing paths
- real complementarity exists (oracle 0.1056) but current selectors
  cannot capture it out-of-sample with available trial count

BGP assessment: strengthens the BGP read by providing out-of-sample
validation that the bottleneck is a policy-selection problem, not a
theory failure.

Deliverables: run.py, README.md, CSV/JSON/figure artifacts
Copilot AI review requested due to automatic review settings March 25, 2026 01:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new experiment that ensembles three existing refinement paths and uses leave-one-trial-out cross-validation (LOO-CV) to select a strategy per condition, aiming to provide out-of-sample validation for the anisotropic inverse bottleneck.

Changes:

  • Introduces adaptive-ensemble-solver/run.py to run conditioned, family-switch, and enhanced joint refinement paths and evaluate multiple selection policies under LOO-CV.
  • Adds an experiment README describing methodology, results, and interpretation.
  • Commits trial-level and aggregated CSV/JSON outputs plus overview/cell-level figures.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/run.py Implements the ensemble evaluation, enhanced joint solver variant, and LOO-CV strategy selection/aggregation/plots.
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/README.md Documents the experiment design, LOO-CV procedure, and summarizes results.
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_trials.csv Trial-level artifact produced by the new experiment.
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_summary.json Machine-readable summary artifact (overall + per-condition + per-cell).
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_summary.csv Per-condition summary artifact.
experiments/pose-anisotropy-interventions/adaptive-ensemble-solver/outputs/adaptive_ensemble_solver_cells.csv Per-cell summary artifact.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +558 to +568
scores = [
(r.conditioned_score, r.conditioned_alpha_error),
(r.family_score, r.family_alpha_error),
(r.joint_score, r.joint_alpha_error),
]
scores.sort(key=lambda x: x[0])
# Use best scoring, but penalize family in sparse_partial
if r.condition == "sparse_partial_high_noise" and scores[0][0] > r.conditioned_score + threshold:
margin_errors.append(r.conditioned_alpha_error)
else:
margin_errors.append(scores[0][1])
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The margin_gated strategy implementation doesn’t match the comment (“If family_score is much lower than conditioned_score, use family; else conditioned”). The current check uses scores[0][0] > r.conditioned_score + threshold where scores[0][0] is the minimum score across all paths, so it doesn’t actually test the conditioned-vs-family margin and can flip behavior in unintuitive ways (especially for negative thresholds). Recommend rewriting this gating to explicitly compare family_score vs conditioned_score (and optionally joint) using a well-defined margin rule, and apply the same rule in the held-out branch.

Suggested change
scores = [
(r.conditioned_score, r.conditioned_alpha_error),
(r.family_score, r.family_alpha_error),
(r.joint_score, r.joint_alpha_error),
]
scores.sort(key=lambda x: x[0])
# Use best scoring, but penalize family in sparse_partial
if r.condition == "sparse_partial_high_noise" and scores[0][0] > r.conditioned_score + threshold:
margin_errors.append(r.conditioned_alpha_error)
else:
margin_errors.append(scores[0][1])
# First choose between conditioned and family based on a margin rule,
# then optionally override with joint if it scores even better.
if r.condition == "sparse_partial_high_noise":
# Apply margin gating: require family to beat conditioned by more than `threshold`
if r.family_score + threshold < r.conditioned_score:
best_score = r.family_score
best_error = r.family_alpha_error
else:
best_score = r.conditioned_score
best_error = r.conditioned_alpha_error
else:
# Outside sparse_partial_high_noise, just pick the better of conditioned vs family
if r.family_score < r.conditioned_score:
best_score = r.family_score
best_error = r.family_alpha_error
else:
best_score = r.conditioned_score
best_error = r.conditioned_alpha_error
# Unless joint is even better by score, in which case use joint
if r.joint_score < best_score:
best_score = r.joint_score
best_error = r.joint_alpha_error
margin_errors.append(best_error)

Copilot uses AI. Check for mistakes.
Comment on lines +652 to +683
def summarize_by_cell(rows: list[TrialRow]) -> list[dict[str, float | str]]:
summary: list[dict[str, float | str]] = []
for condition in FOCUS_CONDITIONS:
for skew_bin in GEOMETRY_SKEW_BIN_LABELS:
subset = [row for row in rows if row.condition == condition and row.geometry_skew_bin == skew_bin]
if not subset:
continue

def mean(attr: str) -> float:
return float(np.mean([getattr(row, attr) for row in subset]))

oos_errors, oos_mean, oos_strategy = loocv_select(subset)

summary.append(
{
"condition": condition,
"alpha_strength_bin": FOCUS_ALPHA_BIN,
"geometry_skew_bin": skew_bin,
"count": len(subset),
"support_gated_alpha_error_mean": mean("support_gated_alpha_error"),
"conditioned_alpha_error_mean": mean("conditioned_alpha_error"),
"family_alpha_error_mean": mean("family_alpha_error"),
"joint_alpha_error_mean": mean("joint_alpha_error"),
"score_competitive_alpha_error_mean": mean("score_competitive_alpha_error"),
"oracle_three_alpha_error_mean": mean("oracle_three_alpha_error"),
"oracle_pose_alpha_error_mean": mean("oracle_pose_alpha_error"),
"oos_loocv_alpha_error_mean": oos_mean,
"oos_loocv_strategy": oos_strategy,
"joint_pose_entropy_mean": mean("joint_pose_entropy"),
}
)
return summary
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the PR description, strategy selection is “per condition”, but summarize_by_cell calls loocv_select on a per-cell subset. That recalibrates the policy using only trials from that single cell, which changes the validation procedure and can produce misleading per-cell OOS values (especially with only 3 trials). If you want per-cell OOS breakdown for a per-condition LOO policy, compute the per-trial held-out predictions once per condition (with training including all other trials in that condition) and then aggregate those held-out errors by geometry_skew_bin.

Copilot uses AI. Check for mistakes.
else:
oos_errors.append(test_row.support_gated_alpha_error)

most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oos_loocv_strategy is reported as the most common fold winner, but with small n (e.g., per-cell n=3) ties are likely and max(...) will pick an arbitrary first-inserted key. This makes the reported strategy label potentially misleading vs the actual fold-by-fold selections. Consider returning/reporting the full fold win counts (or the per-trial chosen strategy) instead of a single label.

Suggested change
most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated"
if strategy_wins:
# Compute the maximum win count and collect all strategies that achieve it.
max_wins = max(strategy_wins.values())
best_strategies = sorted([name for name, count in strategy_wins.items() if count == max_wins])
if len(best_strategies) == 1:
# Single clear winner.
most_common = best_strategies[0]
else:
# Explicitly report a tie rather than arbitrarily picking one strategy.
most_common = "tie:" + ",".join(best_strategies)
else:
# Fall back to a sensible default if no strategies were recorded.
most_common = "support_gated"

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +108
## Per-cell out-of-sample breakdown

| Condition | Skew | OOS LOO-CV | Support-gated | Strategy |
|-----------|------|-----------|---------------|----------|
| sparse_full_noisy | low_skew | 0.2068 | 0.1271 | joint |
| sparse_full_noisy | mid_skew | 0.2982 | 0.2210 | family |
| sparse_full_noisy | high_skew | 0.0202 | 0.0178 | joint |
| sparse_partial_high_noise | low_skew | 0.1761 | 0.1761 | conditioned |
| sparse_partial_high_noise | mid_skew | 0.1557 | 0.1557 | conditioned |
| sparse_partial_high_noise | high_skew | 0.2010 | 0.0174 | conditioned |

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-cell “OOS LOO-CV” table is currently based on running LOO-CV separately within each cell (3 trials), which is not the same as a per-condition LOO policy and can yield unstable / misleading cell-level OOS numbers and strategy labels. If the intended validation is per-condition LOO-CV, it would be clearer to compute held-out predictions per condition once, then break those held-out errors down by cell for reporting.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +51
evaluate_params, family_switching_refine = load_symbols(
"run_family_switching_refinement_experiment",
ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py",
"evaluate_params",
"family_switching_refine",
)

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate_params is loaded here but never referenced in this script. Since load_symbols executes the source module, dropping unused symbol loads will reduce startup overhead and make dependencies clearer.

Suggested change
evaluate_params, family_switching_refine = load_symbols(
"run_family_switching_refinement_experiment",
ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py",
"evaluate_params",
"family_switching_refine",
)
family_switching_refine, = load_symbols(
"run_family_switching_refinement_experiment",
ROOT / "experiments/pose-anisotropy-interventions/family-switching-refinement/run.py",
"family_switching_refine",
)

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +102
OBSERVATION_REGIMES, SIGNATURE_ANGLE_COUNT, write_csv = load_symbols(
"run_weighted_multisource_inverse_experiment",
ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py",
"OBSERVATION_REGIMES",
"SIGNATURE_ANGLE_COUNT",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIGNATURE_ANGLE_COUNT is loaded here but never used. Since load_symbols executes the referenced module, consider removing unused symbol loads to reduce startup time and avoid hidden side effects.

Suggested change
OBSERVATION_REGIMES, SIGNATURE_ANGLE_COUNT, write_csv = load_symbols(
"run_weighted_multisource_inverse_experiment",
ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py",
"OBSERVATION_REGIMES",
"SIGNATURE_ANGLE_COUNT",
OBSERVATION_REGIMES, write_csv = load_symbols(
"run_weighted_multisource_inverse_experiment",
ROOT / "experiments/multisource-control-objects/weighted-multisource-inverse/run.py",
"OBSERVATION_REGIMES",

Copilot uses AI. Check for mistakes.
import json
import math
import os
from dataclasses import dataclass, field
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field is imported from dataclasses but never used. Dropping unused imports helps keep this experiment script easier to audit and avoids confusing readers about intended dataclass behavior.

Suggested change
from dataclasses import dataclass, field
from dataclasses import dataclass

Copilot uses AI. Check for mistakes.
state, _ = improve_over_candidates(context, final_temperature, state, shotgun_candidates)

final_params = state_to_params(state)
final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, base_temperature)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the shotgun stage you select improvements using final_temperature, but then compute/return final_score using base_temperature. This means the perturbation acceptance criterion can optimize a different objective than the score used to choose the best seed/run (and the score you later compare across seeds). Consider evaluating shotgun candidates with the same temperature you use for the returned score (or return the score at final_temperature consistently) to keep selection consistent.

Suggested change
final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, base_temperature)
final_score, final_signature, final_shift, final_entropy = context.score_params(final_params, final_temperature)

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +18
- more anneal stages (5 vs 3) with a wider temperature sweep
- larger initial search radii (1.5x)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring claims the enhanced joint solver adds “more anneal stages (5 vs 3)” and “larger initial search radii (1.5x)”, but the code uses 3 anneal factors and radii increases of ~1.2–1.25× (vs joint-pose-marginalized-solver). Please update the docstring to match the actual parameters so readers aren’t misled about what changed.

Suggested change
- more anneal stages (5 vs 3) with a wider temperature sweep
- larger initial search radii (1.5x)
- a slightly wider 3-stage anneal schedule (broader temperature sweep than the baseline)
- moderately larger initial search radii (~1.21.25× vs the joint pose-marginalized solver)

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +40
ALPHA_STRENGTH_BIN_LABELS, GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, evaluate_candidate_alpha, sample_conditioned_parameters, top_k_indices = load_symbols(
"run_candidate_conditioned_alignment_experiment",
ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py",
"ALPHA_STRENGTH_BIN_LABELS",
"GEOMETRY_SKEW_BIN_LABELS",
"candidate_conditioned_search",
"evaluate_candidate_alpha",
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This load_symbols call imports several names that are never used in this script (e.g., ALPHA_STRENGTH_BIN_LABELS, evaluate_candidate_alpha). Because load_symbols executes the target module, pulling unused symbols adds avoidable import/runtime overhead. Consider only requesting/assigning the symbols that are actually used here.

Suggested change
ALPHA_STRENGTH_BIN_LABELS, GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, evaluate_candidate_alpha, sample_conditioned_parameters, top_k_indices = load_symbols(
"run_candidate_conditioned_alignment_experiment",
ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py",
"ALPHA_STRENGTH_BIN_LABELS",
"GEOMETRY_SKEW_BIN_LABELS",
"candidate_conditioned_search",
"evaluate_candidate_alpha",
GEOMETRY_SKEW_BIN_LABELS, candidate_conditioned_search, sample_conditioned_parameters, top_k_indices = load_symbols(
"run_candidate_conditioned_alignment_experiment",
ROOT / "experiments/pose-anisotropy-interventions/candidate-conditioned-alignment/run.py",
"GEOMETRY_SKEW_BIN_LABELS",
"candidate_conditioned_search",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f580b76b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

else:
oos_errors.append(test_row.support_gated_alpha_error)

most_common = max(strategy_wins, key=lambda k: strategy_wins[k]) if strategy_wins else "support_gated"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Represent tied LOO winners as mixed strategy

oos_loocv_strategy is derived with max(strategy_wins, ...), which returns the first inserted key when multiple strategies have the same win count. With small fold counts (especially the 3-trial cell summaries), ties are common, so this can report a single strategy like conditioned even when fold winners are split across different strategies. That makes the exported strategy label and downstream interpretation inaccurate; tied winners should be surfaced explicitly (e.g., mixed or a list of tied strategies).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resolve the anisotropic inverse bottleneck with validated out-of-sample solver performance

2 participants