labclaw · robotlearning123 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
diff --git a/.claude/skills/autoresearch/SKILL.md b/.claude/skills/autoresearch/SKILL.md
@@ -54,16 +54,35 @@ Then:
 
 1. Create branch: `git checkout -b autoresearch/<tag>`
 2. Read all in-scope files for full context
-3. Add `results.tsv` and `run.log` to `.gitignore` (create if needed):
+3. Add `results.tsv`, `run.log`, and `.autoresearch_wandb.json` to `.gitignore` (create if needed):
    ```bash
-   echo -e "results.tsv\nrun.log" >> .gitignore
+   echo -e "results.tsv\nrun.log\n.autoresearch_wandb.json" >> .gitignore
    ```
 4. Create `results.tsv` with header row:
    ```
    commit	metric	status	description
    ```
-5. Run the baseline (no changes) and record it as the first row
-6. Confirm setup with the human, then begin the loop
+5. **Launch dashboard** (local web UI — zero deps, stdlib only):
+   ```bash
+   python dashboard/server.py --port 8420 --results results.tsv &
+   echo "Dashboard: http://localhost:8420"
+   ```
+   The dashboard auto-refreshes every 3s showing metric trends, keep/discard/crash
+   stats, and a full experiment table. Works in any browser. No install needed.
+
+6. **Initialize wandb** (optional remote logging — requires `pip install wandb`):
+   ```bash
+   # Only if the user wants remote logging
+   python dashboard/wandb_logger.py --init \
+     --project autoresearch-<tag> \
+     --name "<objective>" \
+     --config '{"objective":"<objective>","direction":"<direction>","scope":"<edit-scope>"}'
+   ```
+   If wandb is not installed or not wanted, skip this step. All data is always
+   available locally via `results.tsv` and the dashboard regardless.
+
+7. Run the baseline (no changes) and record it as the first row
+8. Confirm setup with the human, then begin the loop
 
 **Once the human confirms, you are autonomous. Do not ask again.**
 
@@ -130,7 +149,15 @@ Append to `results.tsv` (tab-separated):
 <commit-hash-7char>	<metric-value-or-ERR>	<keep|discard|crash>	<description>
 ```
 
+**Wandb sync** (if initialized in Phase 0):
+```bash
+python dashboard/wandb_logger.py --log \
+  --metric <metric-value-or-nan> --status <keep|discard|crash> \
+  --desc "<description>"
+```
+
 **Do NOT commit results.tsv** — it's in `.gitignore` so reverts don't lose the log.
+The dashboard picks up changes automatically (3s poll). No manual refresh needed.
 
 ### Step 7 — Repeat
 Go back to Step 1. **NEVER STOP.**
@@ -210,7 +237,17 @@ after that is happening in parallel.
 
 ## Phase 2: Summary (when human returns)
 
-When the human interrupts or you detect they're back, produce a summary:
+When the human interrupts or you detect they're back:
+
+1. **Finish wandb run** (if initialized):
+   ```bash
+   python dashboard/wandb_logger.py --finish
+   ```
+2. **Stop the dashboard** (if still running):
+   ```bash
+   pkill -f "dashboard/server.py" 2>/dev/null
+   ```
+3. Produce a summary:
 
 ```
 === Autoresearch Summary ===

diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+__pycache__/
diff --git a/README.md b/README.md
@@ -34,18 +34,53 @@ A Claude Code skill that runs autonomous improvement loops on any codebase. Insp
 └─────────────────────────────────────────────────────┘
 ```
 
+## Dashboard & Observability
+
+### Local Web Dashboard
+
+A built-in web dashboard (zero Python dependencies, stdlib only) shows live experiment
+progress — metric trends, keep/discard/crash stats, and a full experiment table.
+
+```bash
+python dashboard/server.py --port 8420 --results results.tsv
+```
+
+Open `http://localhost:8420` in any browser. Auto-refreshes every 3 seconds.
+
+The skill automatically launches this during setup. No manual action needed.
+
+### Remote Logging with Wandb
+
+Optional integration with [Weights & Biases](https://wandb.ai) for remote experiment tracking:
+
+```bash
+# Install wandb (optional)
+pip install wandb
+
+# During autoresearch, the skill can log to wandb
+python dashboard/wandb_logger.py --init --project my-experiment --config '{"objective":"minimize loss"}'
+
+# Replay all results.tsv data to wandb after the fact
+python dashboard/wandb_logger.py --replay --project my-experiment
+```
+
+Wandb is fully optional. All data is always available locally via `results.tsv` and the dashboard.
+
 ## Install
 
 Copy the skill to your Claude Code skills directory:
 
 ```bash
-# User-level (all projects)
+# User-level (all projects) — skill only
 mkdir -p ~/.claude/skills/autoresearch
 cp .claude/skills/autoresearch/SKILL.md ~/.claude/skills/autoresearch/SKILL.md
 
 # Or project-level (current project only, after cloning this repo)
 mkdir -p /path/to/your/project/.claude/skills/autoresearch
 cp .claude/skills/autoresearch/SKILL.md /path/to/your/project/.claude/skills/autoresearch/
+
+# For dashboard + wandb support, copy the dashboard/ directory into your project:
+cp -r dashboard/ /path/to/your/project/dashboard/
 ```
 
 Or one-liner from GitHub:
@@ -56,6 +91,12 @@ curl -sL https://raw.githubusercontent.com/labclaw/autoresearch-skill/main/.clau
   -o ~/.claude/skills/autoresearch/SKILL.md
 ```
 
+Or use the install script:
+
+```bash
+./install.sh
+```
+
 ## Usage
 
 ### Interactive setup
@@ -156,6 +197,7 @@ Directly from [Karpathy's autoresearch](https://github.com/karpathy/autoresearch
 | ML architecture only | Any code changes |
 | Single GPU required | Any compute environment |
 | Claude Code / Cursor required | Claude Code only |
+| No UI | Local web dashboard + wandb integration |
 
 The loop structure is identical: **propose -> commit -> run -> evaluate -> keep/discard -> repeat**.
 

diff --git a/dashboard/__init__.py b/dashboard/__init__.py
@@ -0,0 +1,6 @@
+# Autoresearch Dashboard & Logging
+
+from dashboard.server import DashboardHandler, parse_results_tsv
+from dashboard.wandb_logger import WandbLogger
+
+__all__ = ["DashboardHandler", "parse_results_tsv", "WandbLogger"]
diff --git a/dashboard/server.py b/dashboard/server.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+"""Lightweight web dashboard for autoresearch experiment tracking.
+
+Serves a live-updating HTML page that reads results.tsv and displays
+metric trends, experiment status, and run details. Uses only Python
+stdlib — no external dependencies required.
+
+Usage:
+    python dashboard/server.py [--port 8420] [--results results.tsv]
+"""
+
+import argparse
+import json
+import os
+from http.server import HTTPServer, SimpleHTTPRequestHandler
+from pathlib import Path
+from typing import Any
+
+HERE = Path(__file__).parent
+TEMPLATES = HERE / "templates"
+STATIC = HERE / "static"
+
+
+def parse_results_tsv(path: Path) -> list[dict[str, Any]]:
+    """Parse results.tsv into a list of dicts."""
+    rows = []
+    if not path.exists():
+        return rows
+    try:
+        text = path.read_text().strip()
+        if not text:
+            return rows
+        lines = text.split("\n")
+        if not lines:
+            return rows
+        header = lines[0].split("\t")
+        for line in lines[1:]:
+            parts = line.split("\t")
+            row = {}
+            for i, col in enumerate(header):
+                row[col] = parts[i] if i < len(parts) else ""
+            rows.append(row)
+    except Exception:
+        pass
+    return rows
+
+
+class DashboardHandler(SimpleHTTPRequestHandler):
+    """HTTP handler that serves the autoresearch dashboard."""
+
+    results_path: Path = Path("results.tsv")
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, directory=str(STATIC), **kwargs)
+
+    def do_GET(self):
+        if self.path == "/" or self.path == "/index.html":
+            self._serve_template()
+        elif self.path == "/api/results":
+            self._serve_results()
+        elif self.path == "/api/status":
+            self._serve_status()
+        else:
+            super().do_GET()
+
+    def _serve_template(self):
+        """Serve the main dashboard HTML."""
+        template = TEMPLATES / "index.html"
+        if template.exists():
+            content = template.read_bytes()
+            self.send_response(200)
+            self.send_header("Content-Type", "text/html; charset=utf-8")
+            self.send_header("Content-Length", str(len(content)))
+            self.end_headers()
+            self.wfile.write(content)
+        else:
+            self.send_error(404, "Template not found")
+
+    def _serve_results(self):
+        """Serve parsed results.tsv as JSON."""
+        rows = parse_results_tsv(self.results_path)
+        payload = json.dumps({"experiments": rows}).encode()
+        self.send_response(200)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Content-Length", str(len(payload)))
+        self.end_headers()
+        self.wfile.write(payload)
+
+    def _serve_status(self):
+        """Serve a lightweight status endpoint."""
+        rows = parse_results_tsv(self.results_path)
+        kept = sum(1 for r in rows if r.get("status") == "keep")
+        discarded = sum(1 for r in rows if r.get("status") == "discard")
+        crashed = sum(1 for r in rows if r.get("status") == "crash")
+        metrics = [
+            float(r["metric"])
+            for r in rows
+            if r.get("metric") and r["metric"] != "ERR" and r.get("status") == "keep"
+        ]
+        payload = json.dumps(
+            {
+                "total": len(rows),
+                "kept": kept,
+                "discarded": discarded,
+                "crashed": crashed,
+                "best_metric": max(metrics) if metrics else None,
+                "worst_metric": min(metrics) if metrics else None,
+                "results_modified": os.path.getmtime(self.results_path)
+                if self.results_path.exists()
+                else 0,
+            }
+        ).encode()
+        self.send_response(200)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Content-Length", str(len(payload)))
+        self.end_headers()
+        self.wfile.write(payload)
+
+    def log_message(self, format, *args):
+        """Quiet logging — only errors."""
+        if args and "404" not in str(args[0]):
+            pass  # suppress noisy access logs
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Autoresearch Dashboard")
+    parser.add_argument(
+        "--port",
+        "-p",
+        type=int,
+        default=8420,
+        help="Port to serve on (default: 8420)",
+    )
+    parser.add_argument(
+        "--results",
+        "-r",
+        type=str,
+        default="results.tsv",
+        help="Path to results.tsv (default: ./results.tsv)",
+    )
+    args = parser.parse_args()
+
+    results_path = Path(args.results).resolve()
+    DashboardHandler.results_path = results_path
+
+    # Serve from the autoresearch working directory for results.tsv access
+    os.chdir(results_path.parent)
+
+    server = HTTPServer(("0.0.0.0", args.port), DashboardHandler)
+    url = f"http://localhost:{args.port}"
+    print(f"Autoresearch Dashboard: {url}")
+    print(f"Watching: {results_path}")
+    print("Press Ctrl+C to stop")
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        print("\nDashboard stopped.")
+        server.server_close()
+
+
+if __name__ == "__main__":
+    main()