Skip to content

inngest/inngest-self-learning-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Inngest Self-Learning Agent

A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts over time.

The interesting part is not just that the agent can rewrite prompts. It is that the first version learned to game its own scoring system. When the evaluation pipeline asked an LLM to improve an underperforming prompt, the model started embedding scoring criteria directly into the generated SOUL.md, turning the metric into the target.

This repo explores that self-learning loop and the guardrails needed to keep it useful:

  • Score every response across relevance, completeness, tool efficiency, and tone
  • Attribute scores to prompt versions so improvements can be compared over time
  • A/B test prompt variants with weighted traffic instead of replacing prompts blindly
  • Run scheduled evaluation to rewrite underperformers and promote stronger versions
  • Block score gaming so generated prompts do not copy evaluation criteria or optimize for the test itself

Read the blog post about this project: https://www.inngest.com/blog/build-self-learning-agent

This project is a fork of Inngest's Utah agent example, extended with response scoring, prompt versioning, and an automated evaluation pipeline.

Simple TypeScript that gives you:

  • Durable agent loop β€” every LLM call and tool execution is an Inngest step
  • Automatic retries β€” LLM API timeouts are handled by Inngest, not your code
  • Singleton concurrency β€” one conversation at a time per chat, no race conditions
  • Cancel on new message β€” user sends again? Current run cancels, new one starts
  • Multi-channel β€” Slack, Telegram, and more via a simple channel interface
  • Local development β€” runs on your machine via connect(), no server needed
  • Response scoring β€” async LLM-based quality evaluation after every reply
  • Prompt versioning β€” A/B test behavioral prompts with weighted random selection
  • Evaluation pipeline β€” scheduled prompt improvement based on score analysis
  • Guardrails β€” keep generated prompts from leaking scoring criteria into agent behavior
  • Sub-agents β€” delegate tasks to isolated agent loops (sync or async)

Architecture

Channel (e.g. Telegram) β†’ Inngest Cloud (webhook + transform) β†’ WebSocket β†’ Local Worker β†’ LLM (Anthropic/OpenAI/Google) β†’ Reply Event β†’ Channel API

The worker connects to Inngest Cloud via WebSocket. No public endpoint. No ngrok. No VPS. Messages flow through Inngest as events, and the agent processes them locally with full filesystem access.

Prerequisites

Setup

1. Create an Inngest Account

  1. Sign up at app.inngest.com
  2. Go to Settings β†’ Keys and copy your:
    • Event Key (for sending events)
    • Signing Key (for authenticating your worker)

2. Configure and Run

git clone https://github.com/mitchellalderson/inngest-self-learning-agent
cd inngest-self-learning-agent
pnpm install
cp .env.example .env

Edit .env with your keys:

ANTHROPIC_API_KEY=sk-ant-...
INNGEST_EVENT_KEY=...
INNGEST_SIGNING_KEY=signkey-prod-...

Then add the environment variables for your channel(s) β€” see setup guides below.

Start the worker:

# Production mode (connects to Inngest Cloud via WebSocket)
pnpm run start

# Development mode (uses local Inngest dev server)
npx inngest-cli@latest dev &
pnpm run dev

On startup, the worker automatically sets up webhooks and transforms for each configured channel.

Channels

The agent supports multiple messaging channels. Each channel has its own setup guide:

  • Telegram β€” Fully automated setup. Just add your bot token and run.
  • Slack β€” Requires creating a Slack app and configuring Event Subscriptions.

Project Structure

src/
β”œβ”€β”€ worker.ts                  # Entry point β€” connect() or serve()
β”œβ”€β”€ client.ts                  # Inngest client
β”œβ”€β”€ config.ts                  # Configuration from env vars
β”œβ”€β”€ agent-loop.ts              # Core think β†’ act β†’ observe cycle
β”œβ”€β”€ setup.ts                   # Channel setup orchestration
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ llm.ts                 # pi-ai wrapper (multi-provider: Anthropic, OpenAI, Google)
β”‚   β”œβ”€β”€ tools.ts               # Tool definitions (TypeBox schemas) + execution
β”‚   β”œβ”€β”€ context.ts             # System prompt builder with workspace file injection
β”‚   β”œβ”€β”€ session.ts             # JSONL session persistence
β”‚   β”œβ”€β”€ memory.ts              # File-based memory system (daily logs + distillation)
β”‚   β”œβ”€β”€ prompt-version.ts      # Prompt versioning and A/B testing
β”‚   β”œβ”€β”€ scoring.ts             # Response quality evaluation
β”‚   β”œβ”€β”€ evaluation.ts          # Prompt performance analysis and improvement
β”‚   β”œβ”€β”€ compaction.ts          # LLM-powered conversation summarization
β”‚   └── logger.ts              # Structured logging utility
β”œβ”€β”€ functions/
β”‚   β”œβ”€β”€ message.ts             # Main agent function (singleton + cancelOn)
β”‚   β”œβ”€β”€ send-reply.ts          # Channel-agnostic reply dispatch
β”‚   β”œβ”€β”€ score.ts               # Async response quality scoring
β”‚   β”œβ”€β”€ acknowledge-message.ts # Message acknowledgment (typing indicator, etc.)
β”‚   β”œβ”€β”€ heartbeat.ts           # Cron-based memory maintenance
β”‚   β”œβ”€β”€ evaluate-prompts.ts    # Cron-based prompt improvement
β”‚   β”œβ”€β”€ sub-agent.ts           # Isolated sub-agent loops (sync/async task delegation)
β”‚   └── failure-handler.ts     # Global error handler with notifications
└── channels/
    β”œβ”€β”€ types.ts               # ChannelHandler interface
    β”œβ”€β”€ index.ts               # Channel registry
    β”œβ”€β”€ setup-helpers.ts       # Inngest REST API helpers for webhook setup
    └── <channel-name>/        # A channel implementation (see README for setup)
        β”œβ”€β”€ handler.ts         # ChannelHandler implementation
        β”œβ”€β”€ api.ts             # API client
        β”œβ”€β”€ setup.ts           # Webhook setup automation
        β”œβ”€β”€ transform.ts       # Webhook transform
        └── format.ts          # Formatting for channel messages
workspace/                       # Agent workspace (persisted across runs)
β”œβ”€β”€ IDENTITY.md                # Agent name, role, emoji
β”œβ”€β”€ SOUL.md                    # Agent personality and behavioral guidelines (fallback)
β”œβ”€β”€ USER.md                    # User information
β”œβ”€β”€ MEMORY.md                  # Long-term memory (agent-writable)
β”œβ”€β”€ memory/                    # Daily logs (YYYY-MM-DD.md, auto-managed)
β”œβ”€β”€ prompts/                   # Versioned prompts for A/B testing
β”‚   β”œβ”€β”€ registry.json          # Version metadata
β”‚   └── v1/SOUL.md             # Versioned behavioral prompts
β”œβ”€β”€ scores/                    # Response quality scores (YYYY-MM-DD.jsonl)
└── sessions/                  # JSONL conversation files (gitignored)

How It Works

The Agent Loop

The core is a while loop where each iteration is an Inngest step:

  1. Think β€” step.run("think") calls the LLM via pi-ai's complete()
  2. Act β€” if the LLM wants tools, each tool runs as step.run("tool-read")
  3. Observe β€” tool results are fed back into the conversation
  4. Repeat β€” until the LLM responds with text (no tools) or max iterations

Inngest auto-indexes duplicate step IDs in loops (think:0, think:1, etc.), so you don't need to track iteration numbers in step names.

Event-Driven Composition

One incoming message triggers multiple independent functions:

Function Purpose Config
agent-handle-message Run the agent loop Singleton per chat, cancel on new message
acknowledge-message Show "typing..." immediately No retries (best effort)
send-reply Format and send the response 3 retries, channel dispatch
agent-handle-score Evaluate response quality (async) 1 retry, fires after reply sent
agent-heartbeat Distill daily logs into long-term memory Cron (every 30 min)
evaluate-prompts Analyze scores, improve prompts Cron (every 6 hours)
agent-sub-agent Run isolated sub-agent for delegated tasks 1 retry, sync or async
global-failure-handler Catch errors, notify user Triggered by inngest/function.failed

Workspace Context Injection

The agent reads markdown files from the workspace directory and injects them into the system prompt:

File Purpose
IDENTITY.md Agent name, role, and emoji
SOUL.md Agent personality, behavioral guidelines, tone, boundaries
USER.md Info about the user (name, timezone, preferences)
MEMORY.md Curated long-term memory (agent-writable)

Edit these files to customize your agent's personality and knowledge. The agent can also update MEMORY.md using the write tool to remember things across conversations.

Note: SOUL.md supports versioning for A/B testing β€” see Prompt Versioning below.

Memory System

The agent has a two-tier memory system:

  • Daily logs (workspace/memory/YYYY-MM-DD.md) β€” append-only notes written via the remember tool during conversations
  • Long-term memory (workspace/MEMORY.md) β€” curated summary distilled from daily logs by the heartbeat function

The agent-heartbeat function runs on a cron schedule (default: every 30 minutes). It checks if daily logs have accumulated enough content, then uses the LLM to distill them into MEMORY.md. Old daily logs are pruned after a configurable retention period (default: 30 days).

Response Scoring

After each agent response, a lightweight LLM evaluates response quality across four dimensions:

Dimension Scale Description
Relevance 0-10 Did the response address the user's question?
Completeness 0-10 Was anything important missing?
Tool Efficiency 0-10 Were tool calls necessary and well-targeted?
Tone Alignment 0-10 Did it match SOUL.md guidelines?

Scores are persisted to workspace/scores/YYYY-MM-DD.jsonl as JSON lines:

{
  "timestamp": "2026-03-12T...",
  "sessionKey": "main",
  "promptVersion": "v1",
  "relevance": 8,
  "completeness": 7,
  "toolEfficiency": 9,
  "tone": 8,
  "composite": 8.0,
  "rationale": "Addressed the core question..."
}

How it works:

  • Runs asynchronously after the reply is sent (doesn't block delivery)
  • Composite score = simple average of the four dimensions
  • Failures don't affect reply delivery

Configuration (env vars):

Variable Default Description
SCORING_ENABLED true Enable/disable scoring
SCORING_PROVIDER anthropic Provider for scoring LLM
SCORING_MODEL claude-3-5-haiku-20241022 Model for scoring

Prompt Versioning

The agent supports A/B testing of behavioral prompts through versioned SOUL.md files:

workspace/prompts/
β”œβ”€β”€ registry.json        # Version metadata + active assignments
β”œβ”€β”€ v1/
β”‚   └── SOUL.md          # Baseline prompt
β”œβ”€β”€ v2/
β”‚   └── SOUL.md          # First variation
└── v3/
    └── SOUL.md          # Second variation

registry.json schema:

{
  "versions": [
    {
      "id": "v1",
      "created": "2026-03-15T00:00:00Z",
      "source": "baseline",
      "active": true,
      "weight": 0.5
    },
    {
      "id": "v2",
      "created": "2026-03-16T00:00:00Z",
      "source": "evaluation-pipeline",
      "active": true,
      "weight": 0.5,
      "parentVersion": "v1"
    }
  ],
  "currentDefault": "v2"
}

How it works:

  • On each agent loop start, the context builder reads registry.json
  • Selects a prompt version using weighted random selection from active versions
  • Weights are normalized automatically if they don't sum to 1.0
  • The selected version's SOUL.md is injected into the system prompt
  • Version ID is stored in session metadata and scoring logs for analysis

Automatic initialization:

  • If registry.json is missing, the system auto-creates a fresh v1 registry
  • Copies the current workspace/SOUL.md to workspace/prompts/v1/SOUL.md
  • If root SOUL.md doesn't exist, creates a default prompt

Configuration:

Variable Default Description
PROMPT_VERSIONING_ENABLED true Enable/disable prompt versioning

Evaluation Pipeline

The evaluation pipeline automatically analyzes scored responses and generates improved prompt versions. In production you can run it on whatever cadence gives the agent enough fresh data, such as nightly; by default this repo runs the evaluator every 6 hours.

The main lesson from the first runs was Goodhart's Law in miniature: once the prompt generator saw enough performance data, it began producing prompts that mirrored the evaluation criteria instead of simply improving the agent's behavior. The prompt-generation step now includes explicit output rules that prohibit scoring targets, metrics, and evaluation data from appearing in generated SOUL.md files.

Testing with Smaller Models

For testing the evaluation pipeline, use a smaller model like Haiku instead of Sonnet. Sonnet produces consistently high-quality responses, making it difficult to generate the score variance needed to trigger prompt improvements:

# Switch to Haiku for testing
AGENT_MODEL=claude-3-5-haiku-20241022

Haiku is more likely to produce varied scores across different question types, which helps the evaluation pipeline identify underperforming prompt versions and generate meaningful improvements.

How it works:

A cron function (evaluate-prompts) runs on a configurable schedule (default: every 6 hours):

  1. Load scores β€” Reads all JSONL files from workspace/scores/
  2. Aggregate β€” Groups scores by prompt version, computes averages
  3. Identify underperformers β€” Finds versions with:
    • Composite score below target (default: 7.0), OR
    • Composite score significantly below best version (default: 1.0+ points gap)
  4. Generate improvements β€” Calls LLM with underperforming prompt + rationales to produce improved SOUL.md
  5. Create new versions β€” Writes improved prompts to workspace/prompts/vN/SOUL.md
  6. Promote winners β€” Versions with β‰₯80% traffic share and β‰₯1.0 point advantage over default become new default
  7. Enforce cap β€” If >5 active versions, retire lowest-scoring (v1 is never retired)

Weight redistribution:

  • New versions start at 50% weight (configurable)
  • Remaining weight distributed proportionally among other active versions
  • Weights normalized automatically

Safety rails:

  • Minimum data points required before any rewrite (default: 10)
  • Maximum active versions cap (default: 5)
  • Baseline (v1) is never deleted, only deprioritized
  • New versions never start at 100% β€” always keep a control
  • Generated prompts are forbidden from including scoring targets, metrics, or evaluation data

Configuration (env vars):

Variable Default Description
EVALUATION_CRON 0 */6 * * * Cron schedule for evaluation runs
EVAL_MIN_DATA_POINTS 10 Minimum scores before version can be rewritten
EVAL_TARGET_COMPOSITE 7.0 Target composite score threshold
EVAL_MAX_VERSIONS 5 Maximum active versions before retirement
EVAL_NEW_VERSION_WEIGHT 0.5 Initial weight for new versions
EVAL_PROMOTION_TRAFFIC 0.8 Traffic share required for promotion
EVAL_PROMOTION_SCORE_GAP 1.0 Score advantage required for promotion
EVAL_SIGNIFICANT_GAP 1.0 Points below best to trigger rewrite

Conversation Compaction

Long conversations get summarized automatically so the agent doesn't lose context or hit token limits:

  • Token estimation: Uses a chars/4 heuristic to estimate conversation size
  • Threshold: Compaction triggers when estimated tokens exceed 80% of the configured max (150K)
  • LLM summarization: Old messages are summarized into a structured checkpoint (goals, progress, decisions, next steps)
  • Recent messages preserved: The most recent ~20K tokens of conversation are kept verbatim
  • Persisted: The compacted session replaces the JSONL file, so it survives restarts

Compaction runs as an Inngest step (step.run("compact")), so it's durable and retryable.

Context Pruning

Long tool results bloat the conversation context and cause the LLM to lose focus. The agent uses two-tier pruning:

  • Soft trim: Tool results over 4K chars get head+tail trimmed (first 1,500 + last 1,500 chars)
  • Hard clear: When total old tool content exceeds 50K chars, old results are replaced entirely
  • Budget warnings: System messages are injected when iterations are running low

Adding New Channels

The agent is channel-agnostic. Each channel implements a ChannelHandler interface (src/channels/types.ts) with methods for sending replies, acknowledging messages, and setup. Each channel directory follows the same structure:

src/channels/<name>/
β”œβ”€β”€ handler.ts      # ChannelHandler implementation (sendReply, acknowledge)
β”œβ”€β”€ api.ts          # API client for the channel's platform
β”œβ”€β”€ setup.ts        # Webhook setup automation
β”œβ”€β”€ transform.ts    # Plain JS transform for Inngest webhook
└── format.ts       # Markdown β†’ channel-specific format conversion

To add Discord, WhatsApp, or any other channel:

  1. Create a new directory under src/channels/ following the structure above
  2. Implement the ChannelHandler interface in handler.ts
  3. Write a webhook transform that converts the channel's payload to agent.message.received
  4. Register the channel in src/channels/index.ts

The agent loop, reply dispatch, and acknowledgment functions are all channel-agnostic β€” no changes needed outside src/channels/.

Key Inngest Features Used

Acknowledgments

This project uses pi-ai (@mariozechner/pi-ai) by Mario Zechner for its unified LLM interface and @mariozechner/pi-coding-agent for it's. standard tools. pi-ai provides a single complete() function that works across Anthropic, OpenAI, Google, and other providers β€” making it easy to swap models without changing any agent code. It's a great library.

License

Apache-2.0

About

A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors