Skip to content

Loodt/sea

Repository files navigation

SEA - Self-Evolving Agent

A self-improving research & synthesis agent that evolves its own playbook, expert personas, and tools through execution feedback. No human labels needed — the system learns from its own traces.

SEA synthesizes ideas from frontier research into one architecture:

Source What SEA takes from it
ACE (Stanford/SambaNova) Evolving playbook via Generator + Reflector + Curator
HyperAgents (Meta AI) Self-referential meta-agent that modifies its own improvement logic
Bilevel Autoresearch Outer meta-loop that injects new code mechanisms at runtime
Pattern Language Reusable skills repository extracted from successful runs
TurboQuant (Google) Context efficiency for long-running meta-reasoning
ARIS (Auto-claude-code-research) Cross-model adversarial review for evaluation integrity
Agent Lightning (Microsoft) Structured spans and credit assignment across multi-turn trajectories
AgentEvolver / GEA / Memento-Skills Novelty pressure in evolution, skill libraries with utility scores

How it works

SEA has two operating modes:

Conductor mode (recommended)

The two-loop architecture selects questions, creates bespoke expert personas, and dispatches research autonomously.

Terminal
  sea conduct <project>             thin CLI, runs until goal met
        |
        |  OUTER LOOP (Conductor)
        |
        |-- LLM "SELECT..."            reads knowledge frontier
        |   picks highest-value          ranks by info gain, feasibility,
        |   open question                domain data density
        |
        |-- LLM "CREATE..."            fresh context window
        |   crafts expert persona        200-300 lines, 6-section anatomy
        |   from creation framework      the "80% investment"
        |
        |   INNER LOOP (Expert)
        |   |-- LLM "RESEARCH..."       1-5 iterations
        |   |   plan → search → synth    until convergence or exhaustion
        |   |   writes findings           epistemic tags on every claim
        |   '-- converge? loop or exit
        |
        |-- LLM "INTEGRATE..."         validates + persists
        |   merges into knowledge store  deduplicates, checks contradictions
        |   updates summary              tracks goal progress
        |
        |-- (every ~3 iters) META       evolves the conductor itself
        |
        '-- repeat until goal met

Pipeline mode

The original 6-step pipeline for single-project deep-dive iterations with self-evaluation and persona evolution.

Terminal
  sea loop <project>                thin CLI, runs forever
        |
        |-- PLAN      → research plan from goal + knowledge
        |-- RESEARCH   → web search, source gathering
        |-- SUMMARIZE  → persist findings to knowledge store
        |-- SYNTHESIZE → write report from knowledge
        |-- EVALUATE   → score output (uses Sonnet for Axiom 1 separation)
        |-- EVOLVE     → improve persona (3 candidates, novelty-scored)
        |
        '-- repeat, auto-rollback on regression

Two-speed evolution

Layer File Evolves Purpose
Conductor CLAUDE.md or AGENTS.md Every ~3-5 iterations Orchestration protocols, meta-strategies
Expert Persona projects/{name}/persona.md Every iteration Domain expertise, heuristics, strategies

The conductor learns meta-lessons slowly across projects. Expert personas adapt rapidly to each project's domain. The conductor playbook filename matches the active provider (CLAUDE.md for Claude Code, AGENTS.md for Codex).

Recent improvements (v0.2)

Five improvements validated on sewage-gold research project (5 conductor dispatches + 2 pipeline iterations, score 4.4 → 6.7):

# Feature What it does
1 Cross-model evaluate Evaluate step uses Sonnet (different weights) while all other steps use Opus. Axiom 1: separate producer from evaluator.
2 Structured spans Every step records timing, token counts, and findings to metrics/spans.jsonl. Enables credit assignment analysis.
3 Success patterns High-IG dispatches auto-record strategy to success-patterns/. Loaded into expert creation alongside failure patterns.
4 Novelty pressure Evolution generates 3 candidates scored on performance×0.7 + novelty×0.3. Diversity budget every 5th iteration forces exploration.
5 Expert library Scores and reuses high-performing personas. Adapts existing persona instead of creating from scratch when a match exists.

Quick start

Prerequisites

Install

git clone https://github.com/Loodt/sea.git
cd sea
npm install

Create a project

npx tsx src/cli.ts new my-research

Answer the discovery questions (goal, success criteria, domain, source preferences, output format). SEA creates:

projects/my-research/
  goal.md              problem statement + acceptance criteria
  persona.md           expert persona tailored to your domain
  state.json           iteration tracker
  knowledge/           structured findings store (JSONL)
  experts/             per-dispatch expert personas
  expert-library/      persona reuse with utility scores
  references/          all sources (links, PDFs, notes)
  experiments/         what was tried + results + WHY analysis
  traces/              raw execution logs with timing
  reflections/         scored evaluations per iteration
  metrics/             scores + spans + conductor metrics (JSONL)
  lineage/             what changed, why, before/after scores
  output/              final deliverables

Run with conductor (recommended)

npx tsx src/cli.ts conduct my-research

The conductor selects questions, creates expert personas, dispatches research, and integrates results. Walk away — it runs until your goal criteria are met.

Run with pipeline

npx tsx src/cli.ts loop my-research

The pipeline runs plan → research → summarize → synthesize → evaluate → evolve in a loop. The evaluate step uses Sonnet by default (Axiom 1 separation).

Commands

sea new <project>                    # interactive project creation
sea conduct <project>                # two-loop conductor (recommended)
sea dispatch <project>               # single conductor iteration
sea loop <project>                   # continuous pipeline loop
sea run <project>                    # single pipeline iteration
sea status [project]                 # show current state and scores
sea history <project>                # evolution timeline with scores
sea rollback <project> [version]     # restore persona to earlier version
sea rollback conductor [version]     # restore conductor to earlier version

# Use a different provider
sea --provider codex conduct <project>
SEA_PROVIDER=codex sea conduct <project>

Replace sea with npx tsx src/cli.ts until you build and link globally.

Key options

Flag Commands Default Purpose
--provider <name> all claude LLM backend: claude or codex
-c, --cooldown <seconds> conduct, loop 30 Pause between iterations
-m, --max <n> conduct, loop unlimited Stop after N iterations
-e, --expert-max <n> conduct, dispatch 5 Max expert inner iterations
--meta-every <n> conduct, loop 3 / 5 Conductor/meta evolution frequency
--evaluate-model <model> run, loop, conduct, dispatch sonnet Model for evaluate step (Axiom 1)

Architecture

sea/
  CLAUDE.md / AGENTS.md        THE CONDUCTOR - evolving playbook (provider-dependent)
  conductor-history/           every version of the conductor playbook
  failure-patterns/            cross-project failure library
  success-patterns/            cross-project success library (gitignored — domain-specific)
  eval/
    rubrics.md                 5-dimension scoring rubrics
    integrity.md               10 truthfulness axioms (evolvable)
  templates/
    expert-creation-framework.md   6-section persona anatomy
  src/                         TypeScript CLI (~4,750 LOC)
    cli.ts                     command dispatcher
    conductor.ts               outer loop orchestration
    expert-loop.ts             inner loop iteration
    expert-factory.ts          persona creation + library lookup
    expert-library.ts          persona scoring and reuse
    knowledge.ts               findings + questions JSONL store
    context.ts                 prompt assembly for pipeline steps
    conductor-context.ts       prompt assembly for conductor steps
    metrics.ts                 scores + spans
    runner.ts                  spawn LLM CLI sessions (claude/codex)
    loop.ts                    pipeline iteration flow
    safety.ts                  regression detection + rollback
    versioner.ts               snapshot/restore
    discovery.ts               interactive project setup
    integrity.ts               knowledge store validation
    types.ts                   all interfaces + defaults
  projects/
    {name}/
      goal.md                  problem statement
      persona.md               expert persona (evolves every iteration)
      persona-history/         every version of persona.md
      knowledge/
        findings.jsonl         structured fact store with lifecycle
        questions.jsonl         open research frontier
        summary.md             compressed state (max 2KB)
      experts/                 per-dispatch expert personas
      expert-library/          persona reuse library (JSONL)
      metrics/
        scores.jsonl           pipeline score trajectory
        spans.jsonl            structured timing per step
        conductor-metrics.jsonl dispatch outcomes
      lineage/changes.jsonl    evolution decisions
      traces/                  raw session output
      reflections/             scored evaluations
      output/                  deliverables

Knowledge layer

Every finding carries a lifecycle and epistemic tag:

  • Tags: [SOURCE] (URL-backed), [DERIVED] (computed), [ESTIMATED] (judgment), [ASSUMED] (untrusted), [UNKNOWN] (honest gap)
  • Lifecycle: provisional → verified (auto after 3 iters if confidence ≥ 0.85 + SOURCE with URL) or refuted/superseded
  • Questions: open → resolved (by finding ID) or deferred

Scoring rubrics

Pipeline iterations are scored on five dimensions (1-10):

Dimension Weight What it measures
Accuracy 0.25 Factual correctness, proper sourcing
Coverage 0.20 Breadth of relevant topics addressed
Coherence 0.15 Logical flow, structure, readability
Insight Quality 0.20 Novel connections, depth, actionability
Process Compliance 0.20 Artifacts, epistemic tags, references

If the rolling 3-iteration average drops >15%, the persona auto-rollbacks to the previous version.

Design principles

  • Never delete anything. Every mutation snapshots the old file. Full audit trail.
  • Persona is strategy. 80% of research quality comes from expert persona fit. The creation framework is the core investment.
  • Exhaustion is knowledge. When a question exhausts, the negative result becomes a structured finding documenting what was searched.
  • Structure over rules. Constraints are architectural (iteration caps, separate scoring model, staged workflow) not instructional.
  • Kill fast, invest slow. Question types have iteration caps (data-hunt: 3, synthesis: 2) to prevent wasted iterations.
  • Learn bidirectionally. Both failure patterns AND success patterns feed into expert creation.

Running SEA

SEA spawns LLM CLI sessions for each step — every session gets a fresh context window. The TypeScript CLI orchestrates files and the loop; all intelligence lives in the LLM sessions.

Supported providers

Provider CLI Conductor playbook Example
Claude Code (default) claude -p CLAUDE.md sea conduct my-project
OpenAI Codex codex exec AGENTS.md sea --provider codex conduct my-project

When switching providers on an existing project, the conductor playbook is read from whichever file exists (falls back across providers). The first meta-evolution step writes the playbook to the new provider's native filename.

The simplest way to start

Open Claude Code (or Codex) in the sea/ directory:

cd sea
claude         # or: codex

Then:

Create a project called "lithium-recycling" with the goal below, then run
sea conduct for 5 iterations. Use npx tsx src/cli.ts to run commands.

Goal:
Find technically viable methods for recovering lithium from spent EV
batteries at >90% recovery rate. Compare hydrometallurgical vs
pyrometallurgical vs direct recycling routes.

The LLM will run sea new interactively, then start the conductor loop.

Monitoring

npx tsx src/cli.ts status my-research    # state + scores
npx tsx src/cli.ts history my-research   # evolution timeline

Or read files directly:

What you want to know Where to look
Current knowledge knowledge/summary.md
All findings knowledge/findings.jsonl
Open questions knowledge/questions.jsonl
Last iteration output output/
How it was scored reflections/iter-NNN.md
Persona evolution lineage/changes.jsonl
Step timing metrics/spans.jsonl

Roadmap

  • Wave 1: Scaffold + CLI + runner + context assembly
  • Wave 2: Live reflection + scoring pipeline
  • Wave 3: Evolution + never-delete versioning
  • Wave 4: Continuous loop + regression rollback
  • Wave 5: Meta-evolution (conductor self-improvement)
  • Wave 6: Two-loop conductor/expert architecture
  • Wave 7: Cross-model evaluate, structured spans, success patterns, novelty pressure, expert library
  • Wave 8: Skills repository (cross-project reusable patterns)
  • Wave 9: Bilevel code injection (runtime tool generation)
  • Wave 10: Context efficiency (trace summarization, skill filtering)

Research foundation

This project synthesizes ideas from:

  • ACE: Agentic Context Engineering — Turn static prompts into a living playbook updated via execution feedback
  • HyperAgents (Meta AI) — Self-referential multi-agent system where the Meta Agent edits its own improvement logic
  • Bilevel Autoresearch — Inner research loop + outer meta-loop that injects new mechanisms at runtime
  • Pattern Language for Skills-Based Agentic AI — Extract reusable patterns from real runs into a skills repository
  • TurboQuant (Google) — KV-cache compression for long-context meta-reasoning
  • ARIS (Auto-claude-code-research-in-sleep) — Cross-model adversarial review, markdown-native state machines
  • Agent Lightning (Microsoft) — Structured observability spans, RL credit assignment across trajectories
  • AgentEvolver — Self-questioning + self-attributing for autonomous improvement
  • GEA (Group-Evolving Agents) — Performance-Novelty scoring to escape local optima
  • Memento-Skills — Skill libraries with utility scores, Read-Execute-Reflect-Write loops
  • OpenSpace — Auto-fix, auto-improve, and cross-agent learning

License

MIT

About

Self-Evolving Agent — autonomous research engine with two-loop architecture, epistemic tagging, and cross-project learning. Runs unattended on Claude Code or OpenAI Codex.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors