Skip to content

Latest commit

 

History

History
435 lines (333 loc) · 17.2 KB

File metadata and controls

435 lines (333 loc) · 17.2 KB

CLAUDE.md

This file provides guidance to Claude AI-Engineers when working with code in this repository: ReCodeAgent. Updated Date (date -u '+%Y-%m-%dT%H:%M:%SZ'): 2025-11-22T12:25:10Z

  • (./2510.23564v2.md) Core ReCode Paper!!! MUST READ!!!
  • ReCodeAgent-Codex-TB2 Harbor ENV Run terminal-bench-2 BENCHMARK Evaluate TASKs Operation Skill: (.claude/skills/ReCodeAgent-TB2-Evaluate/SKILL.md)
  • SPECIFICATION: (dev-spec/)
  • DEV-WORKLOGS: (.worklogs/)
  • RECODE-PAPER-DEMO: (.dev-docs/)
  • CODEX-CLI-Docs: (.knowledge/codex-cli/docs/)
  • CODEX-CLI-TypeScript-SDK: (.knowledge/codex-cli/sdk/typescript/)
  • RECODE-AGENT-ARCHITECTURE: (dev-spec/architecture/RECODE_ARCHITECTURE_V0.1.0.md)
  • ROADMAP: (dev-spec/roadmap/ROADMAP_V2_2025111602.md)

Your Core Skills: (.claude/skills/)


ReCodeAgent -> (codex exec) -> Harbor -> Terminal-Bench 2.0 Evaluate TASKs Run

ReCodeAgent now uses Harbor Container-Unified Architecture to run Terminal-Bench 2.0 tasks.

ReCodeAgent-Codex-TB2 Evolve Task Run Unified Runtime ENV Running Tasks

# Harbor containerized run (recommended)
cd ~/harbor-workspace
harbor run -d terminal-bench@2.0 -t <task-name> -a recode-agent

# Specify template
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
  --agent-kwarg template=recode_tb2_prompt.jinja2

ReCodeAgent-Codex-TB2 Evolve Task cargo run TB2 Test

cargo run --release --manifest-path recode-core/Cargo.toml -- \
  --task <task-name or none task for random task selection> \
  --prompt <prompt-template-name> \
  --max-steps <max-steps default: 99999>
./scripts/run_tb2_test.sh \
    --task <task-name or none task for random task selection> \
    --prompt <prompt-template-name> \
    --max-steps <max-steps default: 99999>

Key Files

Purpose Path
Rust CLI entry recode-core/src/main.rs
Python bridge script scripts/terminal_bench_bridge.py
Prompt templates recode-core/templates/*.jinja2
Harbor environment guide .claude/skills/ReCodeAgent-TB2-Evaluate/SKILL.md

References


Project Overview

ReCodeAgent is a research project implementing the ReCode paper (arXiv:2510.23564) - a novel LLM-based AI agent paradigm that achieves universal granularity control through recursive code generation. The project aims to productionize the academic prototype into a high-performance Rust Core + Codex CLI integrated system.

Core Innovation: ReCode unifies plans and actions into a single code representation, enabling dynamic granularity control. High-level tasks are represented as placeholder functions that recursively expand into finer-grained subtasks until reaching executable primitive actions.

Current Status: Production-ready Rust implementation with Harbor integration for Terminal-Bench 2.0 evaluation.

recode-core/
├── Cargo.toml
├── Cargo.lock
├── Dockerfile                 # Container image config
│
├── examples/
│   └── terminal_bench_smoke.rs  # Quick local test (10 steps)
│
├── harbor-assets/             # Harbor deployment assets
│   ├── recode-agent           # Linux x86_64 binary
│   └── templates/             # Deployment templates
│
├── templates/                 # Jinja2 ReCodeAgent Prompts templates (source)
│   ├── recode_tb2_agents_md.jinja2      # Default TB2 AGENTS.md template
│   ├── recode_tb2_prompt.jinja2         # Default TB2 terminal-bench-2 task prompt
│   ├── recode_microtexecute_tb2_prompt.jinja2  # Micro-Steps TB2 task version
│   ├── recode_tb2_checkpoint_minimal.jinja2    # Minimal version
│   └── fewshots/              # Few-shot examples (old version, need update)
│
├── src/
│   ├── lib.rs
│   ├── main.rs                # CLI entry (subcommands: run, render-template, execute)
│   │
│   ├── analysis/
│   │   ├── mod.rs
│   │   └── ast_splitter.rs
│   │
│   ├── codex/
│   │   ├── mod.rs
│   │   └── thread_manager.rs  # Codex CLI integration (codex exec --json)
│   │
│   ├── environments/
│   │   ├── mod.rs
│   │   ├── alfworld.rs
│   │   ├── process_bridge.rs
│   │   ├── terminal_bench.rs
│   │   └── webshop.rs
│   │
│   ├── execution/
│   │   ├── mod.rs
│   │   ├── env_adapter.rs
│   │   ├── env_factory.rs
│   │   └── python_executor.rs  # Note: includes import re, os fix
│   │
│   ├── orchestrator/
│   │   ├── mod.rs
│   │   ├── engine.rs          # Codex prompt assembly
│   │   └── runtime.rs         # DFS tree execution + checkpoint mechanism
│   │
│   └── tree/
│       ├── mod.rs
│       ├── context.rs
│       ├── node.rs
│       └── tree.rs
│
└── tests/
    ├── codex_turn_tests.rs
    ├── context_tests.rs
    ├── orchestrator_tests.rs
    └── fixtures/
        └── codex_echo.jsonl

Architecture

Harbor Container-Unified Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   macOS Development Environment (Host)          │
├─────────────────────────────────────────────────────────────────┤
│  ReCodeAgent Repository                                         │
│  ~/dev-space/ReCodeAgent/                                       │
│  ├── recode-core/src/           # Rust source code              │
│  └── recode-core/templates/     # Jinja2 Prompt templates (src) │
├─────────────────────────────────────────────────────────────────┤
│  Harbor Installation Directory                                  │
│  ~/.local/share/uv/tools/harbor/.../agents/installed/           │
│  ├── recode_agent.py            # Harbor Agent definition       │
│  └── recode-assets/             # Deployment assets             │
│      ├── recode-agent           # Linux x86_64 binary           │
│      ├── templates/             # Jinja2 templates              │
│      └── scripts/               # Python bridge scripts         │
└─────────────────────────────────────────────────────────────────┘
                              │ Docker Volume
                              │ ${HOME}/.codex:/tmp/host-codex:ro
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│             Docker Container (Terminal-Bench 2.0)               │
├─────────────────────────────────────────────────────────────────┤
│  /app/                        # Working directory               │
│  ├── recode-agent             # ReCodeAgent binary              │
│  ├── AGENTS.md                # Rendered system prompt          │
│  │                            # (auto-loaded by Codex)          │
│  ├── instruction.md           # Task instruction                │
│  ├── templates/               # Jinja2 templates                │
│  └── .codex/                  # Codex CLI config                │
│      ├── auth.json            # Auth (copied from host)         │
│      └── config.toml          # Model configuration             │
└─────────────────────────────────────────────────────────────────┘

Execution Flow (Harbor Mode)

harbor run -d terminal-bench@2.0 -t <task> -a recode-agent
    │
    ▼
┌─────────────────┐
│ 1. Upload Assets│  recode-agent binary, templates/, scripts/
└───────┬─────────┘
        │
        ▼
┌─────────────────┐
│ 2. Install Script│  install-recode-agent.sh.j2
└───────┬─────────┘
        │
        ▼
┌─────────────────┐
│ 3. Execute Steps│  command-0 → command-1 → command-2 → command-3
└───────┬─────────┘
        │
        ├──▶ Step 1: Setup Codex auth (copy auth.json, config.toml)
        │
        ├──▶ Step 2: Render AGENTS.md (recode-agent render-template)
        │
        ├──▶ Step 3: Execute task (recode-agent execute + codex exec)
        │             └── DFS tree traversal + checkpoint self-verification
        │
        └──▶ Step 4: Cleanup

CLI Subcommand Structure

enum Command {
    /// Legacy: Run with explicit bridge configuration
    Run { env_kind, python, bridge, bridge_args, instruction },

    /// Render AGENTS.md template (Harbor Step 2)
    RenderTemplate { template, output, task_name, instruction_path },

    /// Execute task (Harbor Step 3) - Core execution engine
    Execute { task_name, instruction, working_dir, max_steps, codex_home },
}

Development Commands

Harbor Test Run

"(~/harbor-workspace) is the default running harbor workspace path for TB2 test logs."

cd ~/harbor-workspace

# Basic run
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent

# Specify template
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
  --agent-kwarg template=recode_tb2_prompt.jinja2

# Limit steps + debug
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
  --agent-kwarg max_steps=50 --debug

# Batch run all tasks
harbor run -d terminal-bench@2.0 -a recode-agent -n 4

Rust Build

cd ~/dev-space/ReCodeAgent

# Local macOS build (development testing)
cargo build --release --manifest-path recode-core/Cargo.toml

# Linux x86_64 build (Harbor container)
docker build --platform linux/amd64 -f Dockerfile.build-x86 -t recode-builder .
docker create --name tmp recode-builder
docker cp tmp:/build/recode-core/target/release/recode-core ./recode-agent-linux-x86_64
docker rm tmp

# Verify
file ./recode-agent-linux-x86_64
# Should output: ELF 64-bit LSB pie executable, x86-64...

Sync to Harbor

HARBOR_ASSETS=~/.local/share/uv/tools/harbor/lib/python3.13/site-packages/harbor/agents/installed/recode-assets

# Sync binary
cp ./recode-agent-linux-x86_64 $HARBOR_ASSETS/recode-agent
chmod +x $HARBOR_ASSETS/recode-agent

# Sync templates
cp -r recode-core/templates/*.jinja2 $HARBOR_ASSETS/templates/

# Sync scripts
cp scripts/terminal_bench_bridge.py $HARBOR_ASSETS/scripts/

Local Quick Test

# Smoke test (10 steps)
cargo run --example terminal_bench_smoke --release

# CLI subcommand test
cargo run --release --manifest-path recode-core/Cargo.toml -- \
  execute --task-name test --instruction "test" --working-dir /tmp --max-steps 5

View Results

# Latest task results
ls -lt ~/harbor-workspace/jobs/ | head -3

# Execution logs
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/agent/command-2/stdout.txt | tail -50

# Verification result
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/verifier/reward.txt

RECODE Core Methodology

PRELIMINARY

Decision Process of LLM-based AI-Agent

We model the interaction between LLM-based AI-Agent and its environment as a simplified decision process:

$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, R \rangle$

Where:

  • $\mathcal{S}$ : State space
  • $\mathcal{A}$ : Primitive action space
  • $\mathcal{O}$ : Observation space
  • $T: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ : Transition function
  • $R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ : Reward function

At each step, the AI-Agent receives an observation $o \in \mathcal{O}$ and generates decisions that ultimately translate into executable primitive actions $a \in \mathcal{A}$.

Beyond the primitive action space, we introduce a plan space $\mathcal{P}$, which contains intentions and goals that cannot be directly executed but must be refined into a sequence of primitive actions or intermediate sub-plans with coarser granularity. We define the decision space as $\mathcal{D} = \mathcal{A} \cup \mathcal{P}$, representing all possible outputs that AI-Agent can produce at different granularities.

Decision Granularity: Real-world tasks require decisions at different granularities. Fine-grained decisions in $\mathcal{A}$ correspond to immediately executable primitive operations, such as run('crack egg'), while coarse-grained decisions in $\mathcal{P}$ represent high-level intentions that need decomposition, such as prepare_breakfast().

The granularity forms a natural hierarchy: Taking breakfast preparation as an example, the decision "prepare breakfast" is coarser than "cook eggs", which is coarser than "crack egg". Each level contains broader goals and longer time horizons.

METHOD OVERVIEW

ReCode achieves global control over decision granularity through recursive code generation. The core is unifying plan and action into the same representation.

Unify Plan and Action: We represent both plans and actions as Python function calls. Actions are executable, like run('click button'). Plans are represented as placeholder functions, like prepare_breakfast() and get_ingredients().

Recursive Code Generation:

Algorithm 1: The ReCode Algorithm

Procedure ReCode(T, π, E, c):
    if c is None:                      // Initialize
        o_0 ← Reset(E)                 // Reset environment
        c ← Text2Code(T, o_0)          // Convert task to root placeholder
    end if

    code block ← π(c)                  // LLM generates child code

    for each child u in code block:
        if IsPrimitive(u):             // Primitive action
            Execute(u, E)
        else:                          // Placeholder function
            ReCode(T, π, E, u)         // Recursive expansion
        end if
    end for
end procedure

IMPLEMENTATION DETAILS

Task Initialization: Task instruction → root placeholder function solve(instruction, observation)

Context Management: Unified variable namespace, persisted across recursion levels

Error Handling: Self-correction loop (max_rewrite=5)

Recursion Control: Maximum recursion depth 10

Checkpoint Mechanism (new): DFS tree completion ≠ task solved, inject checkpoint for agent self-verification


Key Configuration

Prompt Templates

Template Purpose Default
recode_tb2_agents_md.jinja2 AGENTS.md system prompt
recode_tb2_prompt.jinja2 TB2 task prompt
recode_microtexecute_tb2_prompt.jinja2 Codex expansion micro-execution
recode_tb2_checkpoint_minimal.jinja2 Checkpoint verification

Harbor Agent Parameters

Parameter Type Default Description
template string recode_tb2_agents_md.jinja2 Jinja2 template filename
max_steps int 99999 DFS tree maximum steps

Usage: --agent-kwarg template=xxx --agent-kwarg max_steps=100


Codex CLI Integration

  • Authentication: ~/.codex/auth.json (copied to container /app/.codex/)
  • Model config: ~/.codex/config.toml (default: gpt-5.1-codex-max)
  • AGENTS.md: Auto-discovered and loaded (95%+ token savings)
  • Command: codex exec --json for JSONL event streaming

Quick Reference

# Run task
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent

# Build Linux binary
docker build --platform linux/amd64 -f Dockerfile.build-x86 -t recode-builder .

# Sync templates (quick)
cp recode-core/templates/*.jinja2 ~/.local/share/uv/tools/harbor/.../recode-assets/templates/

# View latest results
cat ~/harbor-workspace/jobs/$(ls -t ~/harbor-workspace/jobs | head -1)/*/agent/command-2/stdout.txt | tail -50