-
Notifications
You must be signed in to change notification settings - Fork 0
feat: integrate neural guidance baseline #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[S:DESIGN v1] approach=neural-guidance+node-reduction alt={heuristics-only,search-without-guidance} reason=efficiency pass
[S:ALG v1] metric=top_k_micro_f1 integration=guided-order pass
[S:TEST v1] unit=23 integration=2 pass
|
You've used up your 5 PR reviews for this month under the Korbit Starter Plan. You'll get 5 more reviews on September 18th, 2025 or you can upgrade to Pro for unlimited PR reviews and enhanced features in your Korbit Console. |
WalkthroughAdds pairwise canonicalization for ARC grids; introduces dataset builder script using it; adds a top-k micro-F1 metric and exports it; creates an integration tool to evaluate neural guidance reducing search expansions; adds related tests; updates Makefile formatting, pytest discovery, requirements, and ledger documentation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Builder as prep_build_dataset.py
participant Canon as arc_solver.canonical
participant FS as File System
User->>Builder: main(--challenges, --output-dir)
Builder->>Builder: Load ARC challenges JSON
loop For each train (input, output)
Builder->>Canon: canonicalize_pair(input, output)
Canon->>Canon: Try 8 D4 transforms + joint color relabel
Canon-->>Builder: (can_input, can_output)
end
Builder->>FS: Save train_X.npy, train_Y.npy
Builder-->>User: Return arrays and paths
sequenceDiagram
autonumber
actor User
participant Tool as tools/integrate_stack.py
participant NG as NeuralGuidance
participant Search as Beam/Search Stack
User->>Tool: main(--challenges, --task-id, --model, --epochs)
alt Model exists
Tool->>NG: load(model)
else Train new
Tool->>Tool: Load all tasks
Tool->>NG: train_from_task_pairs(tasks, epochs)
NG-->>Tool: trained model
Tool->>NG: save(model)
end
Tool->>Tool: load_task(task-id)
par Baseline
Tool->>Search: iterate baseline OPS order
Search-->>Tool: base_nodes count
and Guided
Tool->>NG: score_operations(train_pairs)
NG-->>Tool: guided order
Tool->>Search: iterate guided order
Search-->>Tool: guided_nodes count
end
Tool-->>User: reduction = 1 - guided/base
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (18)
AGENTS.md (1)
75-86: Progress Ledger: include explicit outcomes and align the top-k claim.
- The template requires “command + outcome”; add the outcome for each Test Result.
- M2 note says “micro-F1>=0.55@top-3” while tests appear to assert ≥0.55 at k=2. Align the k value or add evidence for k=3.
Apply:
- Test Result: pytest tests/test_recolor_fix.py tests/test_translate_fix.py tests/test_canonical.py -q + Test Result: pytest tests/test_recolor_fix.py tests/test_translate_fix.py tests/test_canonical.py -q -> PASS - Test Result: pytest tests/test_canonical.py tests/test_prep_build_dataset.py -q + Test Result: pytest tests/test_canonical.py tests/test_prep_build_dataset.py -q -> PASS - Test Result: pytest tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -q - Notes: NeuralGuidance hit micro-F1>=0.55@top-3; integrate_stack cut node expansions by >30% + Test Result: pytest tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -q -> PASS + Notes: NeuralGuidance hit micro-F1>=0.55@top-2 (as tested); integrate_stack cut node expansions by >30%arc_solver/canonical.py (2)
102-149: Good canonicalisation; consider removing vectorize and deduping color relabel logic.
- Correctness looks solid; joint D4 + color relabel is implemented properly.
- Minor perf/readability: replace
np.vectorize(mapping.get)with an array-based map to avoid Python-level loops.- Consider extracting a shared “relabel by joint frequency” helper to avoid duplicating logic with
canonicalize_colors.Apply within this function:
- vals, counts = np.unique(np.concatenate([inp_t.ravel(), out_t.ravel()]), return_counts=True) - order = [int(v) for v, _ in sorted(zip(vals, counts), key=lambda t: (-t[1], t[0]))] - mapping = {c: i for i, c in enumerate(order)} - vect_map = np.vectorize(mapping.get) - inp_c = vect_map(inp_t).astype(np.int16) - out_c = vect_map(out_t).astype(np.int16) + vals, counts = np.unique( + np.concatenate([inp_t.reshape(-1), out_t.reshape(-1)]), return_counts=True + ) + # Sort by frequency desc, then value asc + order = np.array([int(v) for v, _ in sorted(zip(vals, counts), key=lambda t: (-t[1], t[0]))], dtype=vals.dtype) + # Fast LUT for non-negative small colour spaces (ARC is 0..9); fallback to dict if not applicable. + if order.size and order.min() >= 0 and int(order.max()) <= 1024: + lut = np.full(int(order.max()) + 1, -1, dtype=np.int16) + lut[order] = np.arange(order.size, dtype=np.int16) + inp_c = lut[inp_t].astype(np.int16) + out_c = lut[out_t].astype(np.int16) + else: + mapping = {int(c): i for i, c in enumerate(order)} + vec = np.vectorize(mapping.get) + inp_c = vec(inp_t).astype(np.int16) + out_c = vec(out_t).astype(np.int16)Optionally, factor a helper:
def _relabel_by_joint_frequency(a: Array, b: Array) -> tuple[Array, Array]: # shared logic returning relabelled a,b as int16 ...
127-129: Ruff TRY003 nit: exception message style.Static analysis flags long messages in raises. Style-only; keep if you prefer clarity, or shorten messages to silence TRY003.
tests/test_canonical.py (1)
72-85: Add dtype assertions for canonicalize_pair outputs (int16).Strengthens the contract tested by verifying canonical outputs’ dtype.
Apply:
can_a, can_b = canonicalize_pair(a, b) + assert can_a.dtype == np.int16 and can_b.dtype == np.int16tests/test_integrate_stack.py (2)
14-26: Seed randomness to avoid flakiness; keep the assertions.Training may initialise randomly; seed RNGs before training for stability.
Apply:
def test_guidance_reduces_node_expansions(): + np.random.seed(0) + try: + import random + random.seed(0) + except Exception: + pass
5-12: Import hygiene: avoid ad-hoc sys.path for tools if feasible.If
tools/can be a package, preferfrom tools.integrate_stack import evaluate_search_reduction. Otherwise, keep as-is.tests/test_prep_build_dataset.py (1)
15-17: Make the test hermetic or skip if dataset is missing.Avoid relying on external data files in CI. Either pass an explicit small temp challenges JSON or skip when the default file is absent.
Apply this diff to add a skip guard and pass the path explicitly:
@@ -from prep_build_dataset import build_dataset +from prep_build_dataset import build_dataset +import pytest @@ -def test_build_dataset(tmp_path: Path) -> None: - X, Y = build_dataset(output_dir=tmp_path) +def test_build_dataset(tmp_path: Path) -> None: + cp = Path("data/arc-agi_training_challenges.json") + if not cp.exists(): + pytest.skip("training challenges not available") + X, Y = build_dataset(challenges_path=cp, output_dir=tmp_path)arc_solver/neural/metrics.py (1)
27-31: Validate 2D inputs explicitly.Guard against accidental 1D arrays before axis=1 ops.
Apply this diff:
@@ - if probs.shape != labels.shape: + if probs.ndim != 2 or labels.ndim != 2: + raise ValueError("probs and labels must be 2D arrays") + if probs.shape != labels.shape: raise ValueError("probs and labels must have the same shape")tests/test_guidance_metrics.py (2)
13-15: Seed randomness to avoid test flakiness.Training is stochastic; seed before model init.
Apply this diff:
@@ - guidance = NeuralGuidance() + np.random.seed(0) + guidance = NeuralGuidance()
17-19: Avoid using a private method in tests.Tests calling _features_to_vector may break with refactors. Consider exposing a public “predict_proba(features)” API on the model and use it here.
Would you like me to add a small public wrapper on SimpleClassifier to return the probability vector given features?
prep_build_dataset.py (1)
38-45: Silence unused loop variable (Ruff B007).Rename to underscore to match intent.
Apply this diff:
- for task_id, task in challenges.items(): + for _task_id, task in challenges.items():tools/integrate_stack.py (7)
24-25: Don’t configure logging at import time.Move basicConfig into main to avoid side effects on import.
Apply this diff:
-logging.basicConfig(level=logging.INFO) +# Configure logging in main()
72-79: Initialize logging in main.Set format and level after parsing args.
Apply this diff:
@@ - args = parser.parse_args() + args = parser.parse_args() + logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
52-55: Narrow exception and log at debug.Blind except hides real issues.
Apply this diff:
- except Exception: - continue + except Exception as exc: + logger.debug("candidate failed op=%s params=%s err=%s", op, params, exc) + continue
58-60: Use a stable baseline order.Depending on key rotation is brittle; keep insertion order or sort.
Apply this diff:
- baseline_order = list(OPS.keys())[2:] + list(OPS.keys())[:2] + baseline_order = list(OPS.keys()) # stable baseline
65-68: Log structured values without relying on extra.extra fields require custom formatters; use standard message.
Apply this diff:
- logger.info( - "integrate_stack", extra={"baseline": base_nodes, "guided": guided_nodes, "reduction": reduction} - ) + logger.info("integrate_stack baseline=%d guided=%d reduction=%.3f", base_nodes, guided_nodes, reduction)
34-36: Unify dtypes with the rest of the pipeline.Use int16 for consistency with dataset/canonicalization.
Apply this diff:
- inp = np.array(pair["input"], dtype=int) - out = np.array(pair["output"], dtype=int) + inp = np.array(pair["input"], dtype=np.int16) + out = np.array(pair["output"], dtype=np.int16)
13-14: Avoid sys.path mutation.Prefer proper packaging or running via module (-m). Non-blocking.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
AGENTS.md(1 hunks)Makefile(2 hunks)arc_solver/canonical.py(1 hunks)arc_solver/neural/__init__.py(2 hunks)arc_solver/neural/metrics.py(1 hunks)prep_build_dataset.py(1 hunks)pytest.ini(1 hunks)requirements.txt(1 hunks)tests/test_canonical.py(3 hunks)tests/test_guidance_metrics.py(1 hunks)tests/test_integrate_stack.py(1 hunks)tests/test_prep_build_dataset.py(1 hunks)tools/integrate_stack.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
prep_build_dataset.py
📄 CodeRabbit inference engine (AGENTS.md)
prep_build_dataset.py fills train_X.npy and train_Y.npy
Files:
prep_build_dataset.py
AGENTS.md
📄 CodeRabbit inference engine (AGENTS.md)
Maintain the Progress Ledger in AGENTS.md using the specified completion format
Files:
AGENTS.md
tests/test_canonical.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/test_canonical.py must verify canonicalisation (D4 rotations/reflections hash-equal)
Files:
tests/test_canonical.py
Makefile
📄 CodeRabbit inference engine (AGENTS.md)
Ensure Makefile targets run locally
Files:
Makefile
🧬 Code graph analysis (7)
tests/test_prep_build_dataset.py (2)
prep_build_dataset.py (1)
build_dataset(20-56)arc_solver/canonical.py (1)
canonicalize_pair(102-149)
prep_build_dataset.py (1)
arc_solver/canonical.py (1)
canonicalize_pair(102-149)
tools/integrate_stack.py (3)
arc_solver/neural/guidance.py (1)
score_operations(202-217)arc_solver/heuristics.py (1)
score_candidate(132-141)arc_solver/neural/sketches.py (1)
generate_parameter_grid(195-234)
arc_solver/neural/__init__.py (1)
arc_solver/neural/metrics.py (1)
top_k_micro_f1(10-48)
tests/test_integrate_stack.py (3)
arc_solver/grid.py (1)
to_array(37-39)arc_solver/neural/guidance.py (2)
NeuralGuidance(167-354)train_from_task_pairs(250-305)tools/integrate_stack.py (1)
evaluate_search_reduction(40-69)
tests/test_guidance_metrics.py (4)
arc_solver/grid.py (1)
to_array(37-39)arc_solver/neural/guidance.py (4)
NeuralGuidance(167-354)train_from_task_pairs(250-305)_features_to_vector(92-116)forward(41-49)arc_solver/neural/metrics.py (1)
top_k_micro_f1(10-48)arc_solver/features.py (1)
extract_task_features(19-118)
tests/test_canonical.py (1)
arc_solver/canonical.py (3)
canonicalize_colors(28-58)canonicalize_D4(61-99)canonicalize_pair(102-149)
🪛 Ruff (0.12.2)
prep_build_dataset.py
38-38: Loop control variable task_id not used within loop body
Rename unused task_id to _task_id
(B007)
arc_solver/neural/metrics.py
28-28: Avoid specifying long messages outside the exception class
(TRY003)
31-31: Avoid specifying long messages outside the exception class
(TRY003)
arc_solver/canonical.py
127-127: Avoid specifying long messages outside the exception class
(TRY003)
129-129: Avoid specifying long messages outside the exception class
(TRY003)
tools/integrate_stack.py
54-55: try-except-continue detected, consider logging the exception
(S112)
54-54: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (9)
pytest.ini (1)
1-2: LGTM: minimal PyTest config is correct.
testpaths = testsensures discovery under tests/. No issues.Makefile (3)
8-8: LGTM: deps target uses pinned requirements.Tab indentation fixed; no functional change.
18-18: LGTM: submit target is correctly tab-indented and parameterized.
11-15: No action required — scripts present and invokable via $(PY)Both tools exist at tools/build_memory.py and tools/train_guidance_on_arc.py; the Makefile calls them with '$(PY) -u', so file execute permissions are not required. No change needed.
Likely an incorrect or invalid review comment.
tests/test_canonical.py (1)
54-61: LGTM: D4 invariance and idempotence are well-covered.This satisfies the guideline that rotations/reflections hash-equal under canonicalisation.
arc_solver/neural/__init__.py (2)
6-6: LGTM: public export for metric is correct.
top_k_micro_f1is now accessible viaarc_solver.neural.
19-20: LGTM: all updated accordingly.requirements.txt (1)
3-3: Verified: scipy==1.12.0 is compatible with numpy==1.26.4 and PyPI provides prebuilt wheels for CPython 3.11/3.12.SciPy 1.12 requires Python >=3.9,<3.13 and NumPy >=1.22.4,<2.0, and PyPI shows cp311/cp312 manylinux/musllinux/macos/windows wheels for 1.12.0, so standard CI runners using CPython 3.11/3.12 should get wheels (confirm your specific CI image/arch for musl/Apple-tag edge cases). (docs.scipy.org)
tests/test_prep_build_dataset.py (1)
21-23: Strengthen assertions (dtype and round-trip load).File: tests/test_prep_build_dataset.py Lines: 21-23 — validate dtypes and that saved arrays round-trip; dataset present: data/arc-agi_training_challenges.json (1000 items).
Apply this diff:
@@ - cx, cy = canonicalize_pair(X[0], Y[0]) + cx, cy = canonicalize_pair(X[0], Y[0]) assert np.array_equal(cx, X[0]) assert np.array_equal(cy, Y[0]) + assert X[0].dtype == np.int16 and Y[0].dtype == np.int16 + X2 = np.load(x_path, allow_pickle=True) + Y2 = np.load(y_path, allow_pickle=True) + assert len(X2) == len(X) and len(Y2) == len(Y)
| topk_indices = np.argsort(-probs, axis=1)[:, :k] | ||
| pred = np.zeros_like(labels, dtype=bool) | ||
| for i, idxs in enumerate(topk_indices): | ||
| pred[i, idxs] = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use argpartition and vectorized assignment (faster, no Python loop).
Improves performance for large class counts.
Apply this diff:
- topk_indices = np.argsort(-probs, axis=1)[:, :k]
- pred = np.zeros_like(labels, dtype=bool)
- for i, idxs in enumerate(topk_indices):
- pred[i, idxs] = True
+ topk_indices = np.argpartition(-probs, kth=k-1, axis=1)[:, :k]
+ pred = np.zeros_like(labels, dtype=bool)
+ rows = np.arange(probs.shape[0])[:, None]
+ pred[rows, topk_indices] = True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| topk_indices = np.argsort(-probs, axis=1)[:, :k] | |
| pred = np.zeros_like(labels, dtype=bool) | |
| for i, idxs in enumerate(topk_indices): | |
| pred[i, idxs] = True | |
| topk_indices = np.argpartition(-probs, kth=k-1, axis=1)[:, :k] | |
| pred = np.zeros_like(labels, dtype=bool) | |
| rows = np.arange(probs.shape[0])[:, None] | |
| pred[rows, topk_indices] = True |
🤖 Prompt for AI Agents
In arc_solver/neural/metrics.py around lines 33 to 37, the current code uses
np.argsort and a Python loop to set top-k predictions which is slow for many
classes; replace it by using np.argpartition(-probs, kth=k-1, axis=1)[:, :k] to
get the top-k indices in O(n) time, then create pred = np.zeros_like(labels,
dtype=bool) and set the entries to True with a vectorized advanced-index
assignment using row indices (np.arange(probs.shape[0])[:, None]) and the top-k
index array so no Python loop is needed. Ensure k <= number of classes and keep
dtype=bool for pred.
Summary
Testing
pytest tests/test_canonical.py tests/test_prep_build_dataset.py tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -qpython tools/integrate_stack.py --challenges /tmp/tmplhcenils.json --task-id rot_task --epochs 1https://chatgpt.com/codex/tasks/task_e_68c6faccac108322a2ffad7fb6617a8a
Summary by CodeRabbit