Skip to content

feat: error bars for BSEQ benchmark#594

Open
vprusso wants to merge 4 commits intomainfrom
588-bseq-error-bars
Open

feat: error bars for BSEQ benchmark#594
vprusso wants to merge 4 commits intomainfrom
588-bseq-error-bars

Conversation

@vprusso
Copy link
Copy Markdown
Contributor

@vprusso vprusso commented Oct 17, 2025

Closes: #588

  • BSEQ now wraps its metric in a BenchmarkScore and sources the uncertainty from the shared bootstrap helper so values and error bars travel together.
  • Added a reusable Monte Carlo utility for largest-component error estimates in the statistics module, matching the pattern used by WIT.
  • Updated the CLI workflow snippet to show the new BenchmarkScore output format for BSEQ runs.
  • Extended the BSEQ unit tests to cover the shared bootstrap helper.

@vprusso vprusso requested review from Copilot and cosenal October 17, 2025 19:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds error bar support to the BSEQ benchmark by computing Monte Carlo uncertainties for the largest connected component metric. The implementation introduces a reusable bootstrap helper and updates the result structure to use BenchmarkScore containers that carry both values and uncertainties.

Key Changes:

  • Refactored BSEQResult to wrap the largest connected component metric in a BenchmarkScore with uncertainty
  • Implemented bootstrap_largest_component_stddev() for Monte Carlo error estimation using edge-level CHSH statistics
  • Extended BSEQ to compute and propagate per-edge variance through the bootstrap helper

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
metriq_gym/benchmarks/bseq.py Updated chsh_subgraph to return edge statistics, integrated bootstrap uncertainty computation, changed result structure to use BenchmarkScore
metriq_gym/helpers/statistics.py Added union-find based largest component calculator and Monte Carlo bootstrap helper for uncertainty estimation
tests/unit/benchmarks/test_bseq.py New test file covering bootstrap helper edge cases and BSEQResult score exposure
docs/source/cli_workflows.rst Updated example output to show new BenchmarkScore format with uncertainty values

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +65 to +70
largest = 1
for idx in range(num_nodes):
root = find(idx)
if sizes[root] > largest:
largest = sizes[root]
return largest
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialization to 1 is incorrect when there are no edges. For a graph with isolated nodes, the largest component size should be 1, but if num_nodes is 0 or edges is empty with num_nodes > 0, this will return 1 instead of the correct value. When num_nodes <= 0, the function already returns 0 at line 40-41. However, when num_nodes > 0 but edges is empty, the largest component should still be 1 (single isolated node), so the current behavior is actually correct for non-empty graphs. The issue is semantic: initialize to 0 and use max(sizes) instead of tracking during iteration to make the logic clearer.

Suggested change
largest = 1
for idx in range(num_nodes):
root = find(idx)
if sizes[root] > largest:
largest = sizes[root]
return largest
# The size of the largest component is the maximum in sizes
return max(sizes) if num_nodes > 0 else 0

Copilot uses AI. Check for mistakes.
for sample_idx in range(num_samples):
active_edges: list[tuple[int, int]] = []
for (u, v), (mean, std) in edges:
sigma = 0.0 if std is None or np.isnan(std) else float(std)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check std is None is redundant since the type annotation indicates edge_stats values are tuple[float, float], meaning std cannot be None. If NaN values are expected to represent missing standard deviations, document this in the function docstring or consider using Optional[float] in the edge_stats type annotation to make the contract explicit.

Copilot uses AI. Check for mistakes.
# The benchmark checks whether the CHSH inequality is violated (i.e., the sum of correlations exceeds 2,
# indicating entanglement).
if exp_vals[idx] > 2:
std = float(np.sqrt(variances[idx])) if not np.isnan(variances[idx]) else float("nan")
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variance-to-stddev conversion should validate that variances[idx] >= 0 before taking the square root to prevent potential issues with floating-point precision producing negative values due to the accumulation in line 144. Consider using max(variances[idx], 0.0) before np.sqrt.

Suggested change
std = float(np.sqrt(variances[idx])) if not np.isnan(variances[idx]) else float("nan")
std = float(np.sqrt(max(variances[idx], 0.0))) if not np.isnan(variances[idx]) else float("nan")

Copilot uses AI. Check for mistakes.
@cosenal
Copy link
Copy Markdown
Contributor

cosenal commented Nov 3, 2025

Do you have a friendly explanation of how the error is calculated? That is, how we go from measurement errors to the errors on the final value? It's hard to understand it from the code, and it's hard for me to review the code without understand what the calculation is supposed to do.

@vprusso
Copy link
Copy Markdown
Contributor Author

vprusso commented Nov 3, 2025

Do you have a friendly explanation of how the error is calculated? That is, how we go from measurement errors to the errors on the final value? It's hard to understand it from the code, and it's hard for me to review the code without understand what the calculation is supposed to do.

That's a good point. To be quite honest, this MR was an attempt at something that I was trying to figure out myself, but I think (hope) the description (now contained in the bseq.py docstring in b4811a4) should be helpful in reviewing the MR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BSEQ returning only largest connected component (with error bars)

3 participants