Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds error bar support to the BSEQ benchmark by computing Monte Carlo uncertainties for the largest connected component metric. The implementation introduces a reusable bootstrap helper and updates the result structure to use BenchmarkScore containers that carry both values and uncertainties.
Key Changes:
- Refactored
BSEQResultto wrap the largest connected component metric in aBenchmarkScorewith uncertainty - Implemented
bootstrap_largest_component_stddev()for Monte Carlo error estimation using edge-level CHSH statistics - Extended BSEQ to compute and propagate per-edge variance through the bootstrap helper
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
metriq_gym/benchmarks/bseq.py |
Updated chsh_subgraph to return edge statistics, integrated bootstrap uncertainty computation, changed result structure to use BenchmarkScore |
metriq_gym/helpers/statistics.py |
Added union-find based largest component calculator and Monte Carlo bootstrap helper for uncertainty estimation |
tests/unit/benchmarks/test_bseq.py |
New test file covering bootstrap helper edge cases and BSEQResult score exposure |
docs/source/cli_workflows.rst |
Updated example output to show new BenchmarkScore format with uncertainty values |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| largest = 1 | ||
| for idx in range(num_nodes): | ||
| root = find(idx) | ||
| if sizes[root] > largest: | ||
| largest = sizes[root] | ||
| return largest |
There was a problem hiding this comment.
Initialization to 1 is incorrect when there are no edges. For a graph with isolated nodes, the largest component size should be 1, but if num_nodes is 0 or edges is empty with num_nodes > 0, this will return 1 instead of the correct value. When num_nodes <= 0, the function already returns 0 at line 40-41. However, when num_nodes > 0 but edges is empty, the largest component should still be 1 (single isolated node), so the current behavior is actually correct for non-empty graphs. The issue is semantic: initialize to 0 and use max(sizes) instead of tracking during iteration to make the logic clearer.
| largest = 1 | |
| for idx in range(num_nodes): | |
| root = find(idx) | |
| if sizes[root] > largest: | |
| largest = sizes[root] | |
| return largest | |
| # The size of the largest component is the maximum in sizes | |
| return max(sizes) if num_nodes > 0 else 0 |
| for sample_idx in range(num_samples): | ||
| active_edges: list[tuple[int, int]] = [] | ||
| for (u, v), (mean, std) in edges: | ||
| sigma = 0.0 if std is None or np.isnan(std) else float(std) |
There was a problem hiding this comment.
The check std is None is redundant since the type annotation indicates edge_stats values are tuple[float, float], meaning std cannot be None. If NaN values are expected to represent missing standard deviations, document this in the function docstring or consider using Optional[float] in the edge_stats type annotation to make the contract explicit.
| # The benchmark checks whether the CHSH inequality is violated (i.e., the sum of correlations exceeds 2, | ||
| # indicating entanglement). | ||
| if exp_vals[idx] > 2: | ||
| std = float(np.sqrt(variances[idx])) if not np.isnan(variances[idx]) else float("nan") |
There was a problem hiding this comment.
The variance-to-stddev conversion should validate that variances[idx] >= 0 before taking the square root to prevent potential issues with floating-point precision producing negative values due to the accumulation in line 144. Consider using max(variances[idx], 0.0) before np.sqrt.
| std = float(np.sqrt(variances[idx])) if not np.isnan(variances[idx]) else float("nan") | |
| std = float(np.sqrt(max(variances[idx], 0.0))) if not np.isnan(variances[idx]) else float("nan") |
|
Do you have a friendly explanation of how the error is calculated? That is, how we go from measurement errors to the errors on the final value? It's hard to understand it from the code, and it's hard for me to review the code without understand what the calculation is supposed to do. |
That's a good point. To be quite honest, this MR was an attempt at something that I was trying to figure out myself, but I think (hope) the description (now contained in the |
Closes: #588
BenchmarkScoreand sources the uncertainty from the shared bootstrap helper so values and error bars travel together.BenchmarkScoreoutput format for BSEQ runs.