Skip to content

Conversation

@tylerbessire
Copy link
Owner

@tylerbessire tylerbessire commented Sep 14, 2025

Summary

  • add top-k micro-F1 metric for guidance evaluation
  • integrate neural guidance with search ordering and node-reduction utility
  • tests ensuring guidance meets F1 and reduces node expansions

Testing

  • pytest tests/test_canonical.py tests/test_prep_build_dataset.py tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -q
  • python tools/integrate_stack.py --challenges /tmp/tmplhcenils.json --task-id rot_task --epochs 1

https://chatgpt.com/codex/tasks/task_e_68c6faccac108322a2ffad7fb6617a8a

Summary by CodeRabbit

  • New Features
    • Added pairwise grid canonicalization to keep training examples aligned.
    • Introduced a top‑k micro‑F1 evaluation metric for guidance quality.
    • Added a dataset builder to produce canonicalized training arrays.
    • Provided a tool to integrate neural guidance with search and report node reductions.
  • Tests
    • Added coverage for canonicalization, guidance metrics, dataset building, and guided search reductions.
  • Documentation
    • Updated progress ledger and milestones.
  • Style
    • Cleaned up build script formatting and indentation.
  • Chores
    • Added SciPy dependency and PyTest configuration for test discovery.

[S:DESIGN v1] approach=neural-guidance+node-reduction alt={heuristics-only,search-without-guidance} reason=efficiency pass
[S:ALG v1] metric=top_k_micro_f1 integration=guided-order pass
[S:TEST v1] unit=23 integration=2 pass
@korbit-ai
Copy link

korbit-ai bot commented Sep 14, 2025

You've used up your 5 PR reviews for this month under the Korbit Starter Plan. You'll get 5 more reviews on September 18th, 2025 or you can upgrade to Pro for unlimited PR reviews and enhanced features in your Korbit Console.

@coderabbitai
Copy link

coderabbitai bot commented Sep 14, 2025

Walkthrough

Adds pairwise canonicalization for ARC grids; introduces dataset builder script using it; adds a top-k micro-F1 metric and exports it; creates an integration tool to evaluate neural guidance reducing search expansions; adds related tests; updates Makefile formatting, pytest discovery, requirements, and ledger documentation.

Changes

Cohort / File(s) Summary
Docs & Project Config
AGENTS.md, Makefile, pytest.ini, requirements.txt
Ledger text updated; Makefile recipe indentation/formatting normalized; PyTest configured to discover tests; SciPy pinned (scipy==1.12.0) added.
Canonicalization & Dataset Prep
arc_solver/canonical.py, prep_build_dataset.py, tests/test_canonical.py, tests/test_prep_build_dataset.py
New canonicalize_pair for joint D4+color canonicalization; dataset builder script creates canonical train_X.npy/train_Y.npy; tests cover type checks, invariance, idempotence, and dataset outputs.
Neural Metrics Export
arc_solver/neural/metrics.py, arc_solver/neural/__init__.py, tests/test_guidance_metrics.py
New top_k_micro_f1 metric; exported via package __all__; test trains NeuralGuidance on a toy task and asserts metric threshold.
Guided Search Integration
tools/integrate_stack.py, tests/test_integrate_stack.py
New tool to load/train NeuralGuidance, compare baseline vs guided operation orders, count node expansions, and report reduction; test asserts >30% reduction on a simple task.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Builder as prep_build_dataset.py
  participant Canon as arc_solver.canonical
  participant FS as File System

  User->>Builder: main(--challenges, --output-dir)
  Builder->>Builder: Load ARC challenges JSON
  loop For each train (input, output)
    Builder->>Canon: canonicalize_pair(input, output)
    Canon->>Canon: Try 8 D4 transforms + joint color relabel
    Canon-->>Builder: (can_input, can_output)
  end
  Builder->>FS: Save train_X.npy, train_Y.npy
  Builder-->>User: Return arrays and paths
Loading
sequenceDiagram
  autonumber
  actor User
  participant Tool as tools/integrate_stack.py
  participant NG as NeuralGuidance
  participant Search as Beam/Search Stack

  User->>Tool: main(--challenges, --task-id, --model, --epochs)
  alt Model exists
    Tool->>NG: load(model)
  else Train new
    Tool->>Tool: Load all tasks
    Tool->>NG: train_from_task_pairs(tasks, epochs)
    NG-->>Tool: trained model
    Tool->>NG: save(model)
  end
  Tool->>Tool: load_task(task-id)
  par Baseline
    Tool->>Search: iterate baseline OPS order
    Search-->>Tool: base_nodes count
  and Guided
    Tool->>NG: score_operations(train_pairs)
    NG-->>Tool: guided order
    Tool->>Search: iterate guided order
    Search-->>Tool: guided_nodes count
  end
  Tool-->>User: reduction = 1 - guided/base
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws—new trails appear,
Grids align, colors clear!
D4 spins, yet pairs stay true,
Guidance learns what ops to do.
Nodes fall faster, F1 climbs high—
Datasets saved; we hop and try.
Carrot-shaped stars in the training sky. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: integrate neural guidance baseline" is concise and accurately reflects the PR’s primary change—adding integration of neural guidance with baseline/search ordering and supporting evaluation utilities (metrics, tools, and tests). It aligns with the PR objectives and the main files added/modified (metrics, integrate_stack, tests), avoids noisy detail, and is clear for teammates scanning the history. The phrasing is specific enough to convey the main intent without needing to list every file.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/complete-section-m0-in-agents.md

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tylerbessire tylerbessire merged commit c9d69e9 into main Sep 14, 2025
2 of 6 checks passed
@tylerbessire tylerbessire deleted the codex/complete-section-m0-in-agents.md branch September 14, 2025 19:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (18)
AGENTS.md (1)

75-86: Progress Ledger: include explicit outcomes and align the top-k claim.

  • The template requires “command + outcome”; add the outcome for each Test Result.
  • M2 note says “micro-F1>=0.55@top-3” while tests appear to assert ≥0.55 at k=2. Align the k value or add evidence for k=3.

Apply:

-    Test Result: pytest tests/test_recolor_fix.py tests/test_translate_fix.py tests/test_canonical.py -q
+    Test Result: pytest tests/test_recolor_fix.py tests/test_translate_fix.py tests/test_canonical.py -q -> PASS

-    Test Result: pytest tests/test_canonical.py tests/test_prep_build_dataset.py -q
+    Test Result: pytest tests/test_canonical.py tests/test_prep_build_dataset.py -q -> PASS

-    Test Result: pytest tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -q
-    Notes: NeuralGuidance hit micro-F1>=0.55@top-3; integrate_stack cut node expansions by >30%
+    Test Result: pytest tests/test_guidance_metrics.py tests/test_integrate_stack.py tests/test_guidance.py tests/test_guidance_training.py tests/test_guidance_from_tasks.py -q -> PASS
+    Notes: NeuralGuidance hit micro-F1>=0.55@top-2 (as tested); integrate_stack cut node expansions by >30%
arc_solver/canonical.py (2)

102-149: Good canonicalisation; consider removing vectorize and deduping color relabel logic.

  • Correctness looks solid; joint D4 + color relabel is implemented properly.
  • Minor perf/readability: replace np.vectorize(mapping.get) with an array-based map to avoid Python-level loops.
  • Consider extracting a shared “relabel by joint frequency” helper to avoid duplicating logic with canonicalize_colors.

Apply within this function:

-        vals, counts = np.unique(np.concatenate([inp_t.ravel(), out_t.ravel()]), return_counts=True)
-        order = [int(v) for v, _ in sorted(zip(vals, counts), key=lambda t: (-t[1], t[0]))]
-        mapping = {c: i for i, c in enumerate(order)}
-        vect_map = np.vectorize(mapping.get)
-        inp_c = vect_map(inp_t).astype(np.int16)
-        out_c = vect_map(out_t).astype(np.int16)
+        vals, counts = np.unique(
+            np.concatenate([inp_t.reshape(-1), out_t.reshape(-1)]), return_counts=True
+        )
+        # Sort by frequency desc, then value asc
+        order = np.array([int(v) for v, _ in sorted(zip(vals, counts), key=lambda t: (-t[1], t[0]))], dtype=vals.dtype)
+        # Fast LUT for non-negative small colour spaces (ARC is 0..9); fallback to dict if not applicable.
+        if order.size and order.min() >= 0 and int(order.max()) <= 1024:
+            lut = np.full(int(order.max()) + 1, -1, dtype=np.int16)
+            lut[order] = np.arange(order.size, dtype=np.int16)
+            inp_c = lut[inp_t].astype(np.int16)
+            out_c = lut[out_t].astype(np.int16)
+        else:
+            mapping = {int(c): i for i, c in enumerate(order)}
+            vec = np.vectorize(mapping.get)
+            inp_c = vec(inp_t).astype(np.int16)
+            out_c = vec(out_t).astype(np.int16)

Optionally, factor a helper:

def _relabel_by_joint_frequency(a: Array, b: Array) -> tuple[Array, Array]:
    # shared logic returning relabelled a,b as int16
    ...

127-129: Ruff TRY003 nit: exception message style.

Static analysis flags long messages in raises. Style-only; keep if you prefer clarity, or shorten messages to silence TRY003.

tests/test_canonical.py (1)

72-85: Add dtype assertions for canonicalize_pair outputs (int16).

Strengthens the contract tested by verifying canonical outputs’ dtype.

Apply:

     can_a, can_b = canonicalize_pair(a, b)
+    assert can_a.dtype == np.int16 and can_b.dtype == np.int16
tests/test_integrate_stack.py (2)

14-26: Seed randomness to avoid flakiness; keep the assertions.

Training may initialise randomly; seed RNGs before training for stability.

Apply:

 def test_guidance_reduces_node_expansions():
+    np.random.seed(0)
+    try:
+        import random
+        random.seed(0)
+    except Exception:
+        pass

5-12: Import hygiene: avoid ad-hoc sys.path for tools if feasible.

If tools/ can be a package, prefer from tools.integrate_stack import evaluate_search_reduction. Otherwise, keep as-is.

tests/test_prep_build_dataset.py (1)

15-17: Make the test hermetic or skip if dataset is missing.

Avoid relying on external data files in CI. Either pass an explicit small temp challenges JSON or skip when the default file is absent.

Apply this diff to add a skip guard and pass the path explicitly:

@@
-from prep_build_dataset import build_dataset
+from prep_build_dataset import build_dataset
+import pytest
@@
-def test_build_dataset(tmp_path: Path) -> None:
-    X, Y = build_dataset(output_dir=tmp_path)
+def test_build_dataset(tmp_path: Path) -> None:
+    cp = Path("data/arc-agi_training_challenges.json")
+    if not cp.exists():
+        pytest.skip("training challenges not available")
+    X, Y = build_dataset(challenges_path=cp, output_dir=tmp_path)
arc_solver/neural/metrics.py (1)

27-31: Validate 2D inputs explicitly.

Guard against accidental 1D arrays before axis=1 ops.

Apply this diff:

@@
-    if probs.shape != labels.shape:
+    if probs.ndim != 2 or labels.ndim != 2:
+        raise ValueError("probs and labels must be 2D arrays")
+    if probs.shape != labels.shape:
         raise ValueError("probs and labels must have the same shape")
tests/test_guidance_metrics.py (2)

13-15: Seed randomness to avoid test flakiness.

Training is stochastic; seed before model init.

Apply this diff:

@@
-    guidance = NeuralGuidance()
+    np.random.seed(0)
+    guidance = NeuralGuidance()

17-19: Avoid using a private method in tests.

Tests calling _features_to_vector may break with refactors. Consider exposing a public “predict_proba(features)” API on the model and use it here.

Would you like me to add a small public wrapper on SimpleClassifier to return the probability vector given features?

prep_build_dataset.py (1)

38-45: Silence unused loop variable (Ruff B007).

Rename to underscore to match intent.

Apply this diff:

-    for task_id, task in challenges.items():
+    for _task_id, task in challenges.items():
tools/integrate_stack.py (7)

24-25: Don’t configure logging at import time.

Move basicConfig into main to avoid side effects on import.

Apply this diff:

-logging.basicConfig(level=logging.INFO)
+# Configure logging in main()

72-79: Initialize logging in main.

Set format and level after parsing args.

Apply this diff:

@@
-    args = parser.parse_args()
+    args = parser.parse_args()
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")

52-55: Narrow exception and log at debug.

Blind except hides real issues.

Apply this diff:

-                except Exception:
-                    continue
+                except Exception as exc:
+                    logger.debug("candidate failed op=%s params=%s err=%s", op, params, exc)
+                    continue

58-60: Use a stable baseline order.

Depending on key rotation is brittle; keep insertion order or sort.

Apply this diff:

-    baseline_order = list(OPS.keys())[2:] + list(OPS.keys())[:2]
+    baseline_order = list(OPS.keys())  # stable baseline

65-68: Log structured values without relying on extra.

extra fields require custom formatters; use standard message.

Apply this diff:

-    logger.info(
-        "integrate_stack", extra={"baseline": base_nodes, "guided": guided_nodes, "reduction": reduction}
-    )
+    logger.info("integrate_stack baseline=%d guided=%d reduction=%.3f", base_nodes, guided_nodes, reduction)

34-36: Unify dtypes with the rest of the pipeline.

Use int16 for consistency with dataset/canonicalization.

Apply this diff:

-        inp = np.array(pair["input"], dtype=int)
-        out = np.array(pair["output"], dtype=int)
+        inp = np.array(pair["input"], dtype=np.int16)
+        out = np.array(pair["output"], dtype=np.int16)

13-14: Avoid sys.path mutation.

Prefer proper packaging or running via module (-m). Non-blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5966e44 and af49f58.

📒 Files selected for processing (13)
  • AGENTS.md (1 hunks)
  • Makefile (2 hunks)
  • arc_solver/canonical.py (1 hunks)
  • arc_solver/neural/__init__.py (2 hunks)
  • arc_solver/neural/metrics.py (1 hunks)
  • prep_build_dataset.py (1 hunks)
  • pytest.ini (1 hunks)
  • requirements.txt (1 hunks)
  • tests/test_canonical.py (3 hunks)
  • tests/test_guidance_metrics.py (1 hunks)
  • tests/test_integrate_stack.py (1 hunks)
  • tests/test_prep_build_dataset.py (1 hunks)
  • tools/integrate_stack.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
prep_build_dataset.py

📄 CodeRabbit inference engine (AGENTS.md)

prep_build_dataset.py fills train_X.npy and train_Y.npy

Files:

  • prep_build_dataset.py
AGENTS.md

📄 CodeRabbit inference engine (AGENTS.md)

Maintain the Progress Ledger in AGENTS.md using the specified completion format

Files:

  • AGENTS.md
tests/test_canonical.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/test_canonical.py must verify canonicalisation (D4 rotations/reflections hash-equal)

Files:

  • tests/test_canonical.py
Makefile

📄 CodeRabbit inference engine (AGENTS.md)

Ensure Makefile targets run locally

Files:

  • Makefile
🧬 Code graph analysis (7)
tests/test_prep_build_dataset.py (2)
prep_build_dataset.py (1)
  • build_dataset (20-56)
arc_solver/canonical.py (1)
  • canonicalize_pair (102-149)
prep_build_dataset.py (1)
arc_solver/canonical.py (1)
  • canonicalize_pair (102-149)
tools/integrate_stack.py (3)
arc_solver/neural/guidance.py (1)
  • score_operations (202-217)
arc_solver/heuristics.py (1)
  • score_candidate (132-141)
arc_solver/neural/sketches.py (1)
  • generate_parameter_grid (195-234)
arc_solver/neural/__init__.py (1)
arc_solver/neural/metrics.py (1)
  • top_k_micro_f1 (10-48)
tests/test_integrate_stack.py (3)
arc_solver/grid.py (1)
  • to_array (37-39)
arc_solver/neural/guidance.py (2)
  • NeuralGuidance (167-354)
  • train_from_task_pairs (250-305)
tools/integrate_stack.py (1)
  • evaluate_search_reduction (40-69)
tests/test_guidance_metrics.py (4)
arc_solver/grid.py (1)
  • to_array (37-39)
arc_solver/neural/guidance.py (4)
  • NeuralGuidance (167-354)
  • train_from_task_pairs (250-305)
  • _features_to_vector (92-116)
  • forward (41-49)
arc_solver/neural/metrics.py (1)
  • top_k_micro_f1 (10-48)
arc_solver/features.py (1)
  • extract_task_features (19-118)
tests/test_canonical.py (1)
arc_solver/canonical.py (3)
  • canonicalize_colors (28-58)
  • canonicalize_D4 (61-99)
  • canonicalize_pair (102-149)
🪛 Ruff (0.12.2)
prep_build_dataset.py

38-38: Loop control variable task_id not used within loop body

Rename unused task_id to _task_id

(B007)

arc_solver/neural/metrics.py

28-28: Avoid specifying long messages outside the exception class

(TRY003)


31-31: Avoid specifying long messages outside the exception class

(TRY003)

arc_solver/canonical.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


129-129: Avoid specifying long messages outside the exception class

(TRY003)

tools/integrate_stack.py

54-55: try-except-continue detected, consider logging the exception

(S112)


54-54: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (9)
pytest.ini (1)

1-2: LGTM: minimal PyTest config is correct.

testpaths = tests ensures discovery under tests/. No issues.

Makefile (3)

8-8: LGTM: deps target uses pinned requirements.

Tab indentation fixed; no functional change.


18-18: LGTM: submit target is correctly tab-indented and parameterized.


11-15: No action required — scripts present and invokable via $(PY)

Both tools exist at tools/build_memory.py and tools/train_guidance_on_arc.py; the Makefile calls them with '$(PY) -u', so file execute permissions are not required. No change needed.

Likely an incorrect or invalid review comment.

tests/test_canonical.py (1)

54-61: LGTM: D4 invariance and idempotence are well-covered.

This satisfies the guideline that rotations/reflections hash-equal under canonicalisation.

arc_solver/neural/__init__.py (2)

6-6: LGTM: public export for metric is correct.

top_k_micro_f1 is now accessible via arc_solver.neural.


19-20: LGTM: all updated accordingly.

requirements.txt (1)

3-3: Verified: scipy==1.12.0 is compatible with numpy==1.26.4 and PyPI provides prebuilt wheels for CPython 3.11/3.12.

SciPy 1.12 requires Python >=3.9,<3.13 and NumPy >=1.22.4,<2.0, and PyPI shows cp311/cp312 manylinux/musllinux/macos/windows wheels for 1.12.0, so standard CI runners using CPython 3.11/3.12 should get wheels (confirm your specific CI image/arch for musl/Apple-tag edge cases). (docs.scipy.org)

tests/test_prep_build_dataset.py (1)

21-23: Strengthen assertions (dtype and round-trip load).

File: tests/test_prep_build_dataset.py Lines: 21-23 — validate dtypes and that saved arrays round-trip; dataset present: data/arc-agi_training_challenges.json (1000 items).

Apply this diff:

@@
-    cx, cy = canonicalize_pair(X[0], Y[0])
+    cx, cy = canonicalize_pair(X[0], Y[0])
     assert np.array_equal(cx, X[0])
     assert np.array_equal(cy, Y[0])
+    assert X[0].dtype == np.int16 and Y[0].dtype == np.int16
+    X2 = np.load(x_path, allow_pickle=True)
+    Y2 = np.load(y_path, allow_pickle=True)
+    assert len(X2) == len(X) and len(Y2) == len(Y)

Comment on lines +33 to +37
topk_indices = np.argsort(-probs, axis=1)[:, :k]
pred = np.zeros_like(labels, dtype=bool)
for i, idxs in enumerate(topk_indices):
pred[i, idxs] = True

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use argpartition and vectorized assignment (faster, no Python loop).

Improves performance for large class counts.

Apply this diff:

-    topk_indices = np.argsort(-probs, axis=1)[:, :k]
-    pred = np.zeros_like(labels, dtype=bool)
-    for i, idxs in enumerate(topk_indices):
-        pred[i, idxs] = True
+    topk_indices = np.argpartition(-probs, kth=k-1, axis=1)[:, :k]
+    pred = np.zeros_like(labels, dtype=bool)
+    rows = np.arange(probs.shape[0])[:, None]
+    pred[rows, topk_indices] = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
topk_indices = np.argsort(-probs, axis=1)[:, :k]
pred = np.zeros_like(labels, dtype=bool)
for i, idxs in enumerate(topk_indices):
pred[i, idxs] = True
topk_indices = np.argpartition(-probs, kth=k-1, axis=1)[:, :k]
pred = np.zeros_like(labels, dtype=bool)
rows = np.arange(probs.shape[0])[:, None]
pred[rows, topk_indices] = True
🤖 Prompt for AI Agents
In arc_solver/neural/metrics.py around lines 33 to 37, the current code uses
np.argsort and a Python loop to set top-k predictions which is slow for many
classes; replace it by using np.argpartition(-probs, kth=k-1, axis=1)[:, :k] to
get the top-k indices in O(n) time, then create pred = np.zeros_like(labels,
dtype=bool) and set the entries to True with a vectorized advanced-index
assignment using row indices (np.arange(probs.shape[0])[:, None]) and the top-k
index array so no Python loop is needed. Ensure k <= number of classes and keep
dtype=bool for pred.

@coderabbitai coderabbitai bot mentioned this pull request Sep 21, 2025
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants