A durable AI agent built with Inngest and pi-ai that experiments with its own prompts over time. It runs a normal think/act/observe loop, scores responses after the fact, and uses scheduled evaluation jobs to create, test, and promote better behavioral prompts over time.
The interesting part is not just that the agent can rewrite prompts. It is that the first version learned to game its own scoring system. When the evaluation pipeline asked an LLM to improve an underperforming prompt, the model started embedding scoring criteria directly into the generated SOUL.md, turning the metric into the target.
This repo explores that self-learning loop and the guardrails needed to keep it useful:
- Score every response across relevance, completeness, tool efficiency, and tone
- Attribute scores to prompt versions so improvements can be compared over time
- A/B test prompt variants with weighted traffic instead of replacing prompts blindly
- Run scheduled evaluation to rewrite underperformers and promote stronger versions
- Block score gaming so generated prompts do not copy evaluation criteria or optimize for the test itself
Read the blog post about this project: https://www.inngest.com/blog/build-self-learning-agent
This project is a fork of Inngest's Utah agent example, extended with response scoring, prompt versioning, and an automated evaluation pipeline.
Simple TypeScript that gives you:
- Durable agent loop β every LLM call and tool execution is an Inngest step
- Automatic retries β LLM API timeouts are handled by Inngest, not your code
- Singleton concurrency β one conversation at a time per chat, no race conditions
- Cancel on new message β user sends again? Current run cancels, new one starts
- Multi-channel β Slack, Telegram, and more via a simple channel interface
- Local development β runs on your machine via
connect(), no server needed - Response scoring β async LLM-based quality evaluation after every reply
- Prompt versioning β A/B test behavioral prompts with weighted random selection
- Evaluation pipeline β scheduled prompt improvement based on score analysis
- Guardrails β keep generated prompts from leaking scoring criteria into agent behavior
- Sub-agents β delegate tasks to isolated agent loops (sync or async)
Channel (e.g. Telegram) β Inngest Cloud (webhook + transform) β WebSocket β Local Worker β LLM (Anthropic/OpenAI/Google) β Reply Event β Channel API
The worker connects to Inngest Cloud via WebSocket. No public endpoint. No ngrok. No VPS. Messages flow through Inngest as events, and the agent processes them locally with full filesystem access.
- Node.js 23+ (uses native TypeScript strip-types)
- LLM API key (e.g. Anthropic API key (console.anthropic.com))
- Inngest account (app.inngest.com)
- At least one channel configured (see Channels below)
- Sign up at app.inngest.com
- Go to Settings β Keys and copy your:
- Event Key (for sending events)
- Signing Key (for authenticating your worker)
git clone https://github.com/mitchellalderson/inngest-self-learning-agent
cd inngest-self-learning-agent
pnpm install
cp .env.example .envEdit .env with your keys:
ANTHROPIC_API_KEY=sk-ant-...
INNGEST_EVENT_KEY=...
INNGEST_SIGNING_KEY=signkey-prod-...Then add the environment variables for your channel(s) β see setup guides below.
Start the worker:
# Production mode (connects to Inngest Cloud via WebSocket)
pnpm run start
# Development mode (uses local Inngest dev server)
npx inngest-cli@latest dev &
pnpm run devOn startup, the worker automatically sets up webhooks and transforms for each configured channel.
The agent supports multiple messaging channels. Each channel has its own setup guide:
- Telegram β Fully automated setup. Just add your bot token and run.
- Slack β Requires creating a Slack app and configuring Event Subscriptions.
src/
βββ worker.ts # Entry point β connect() or serve()
βββ client.ts # Inngest client
βββ config.ts # Configuration from env vars
βββ agent-loop.ts # Core think β act β observe cycle
βββ setup.ts # Channel setup orchestration
βββ lib/
β βββ llm.ts # pi-ai wrapper (multi-provider: Anthropic, OpenAI, Google)
β βββ tools.ts # Tool definitions (TypeBox schemas) + execution
β βββ context.ts # System prompt builder with workspace file injection
β βββ session.ts # JSONL session persistence
β βββ memory.ts # File-based memory system (daily logs + distillation)
β βββ prompt-version.ts # Prompt versioning and A/B testing
β βββ scoring.ts # Response quality evaluation
β βββ evaluation.ts # Prompt performance analysis and improvement
β βββ compaction.ts # LLM-powered conversation summarization
β βββ logger.ts # Structured logging utility
βββ functions/
β βββ message.ts # Main agent function (singleton + cancelOn)
β βββ send-reply.ts # Channel-agnostic reply dispatch
β βββ score.ts # Async response quality scoring
β βββ acknowledge-message.ts # Message acknowledgment (typing indicator, etc.)
β βββ heartbeat.ts # Cron-based memory maintenance
β βββ evaluate-prompts.ts # Cron-based prompt improvement
β βββ sub-agent.ts # Isolated sub-agent loops (sync/async task delegation)
β βββ failure-handler.ts # Global error handler with notifications
βββ channels/
βββ types.ts # ChannelHandler interface
βββ index.ts # Channel registry
βββ setup-helpers.ts # Inngest REST API helpers for webhook setup
βββ <channel-name>/ # A channel implementation (see README for setup)
βββ handler.ts # ChannelHandler implementation
βββ api.ts # API client
βββ setup.ts # Webhook setup automation
βββ transform.ts # Webhook transform
βββ format.ts # Formatting for channel messages
workspace/ # Agent workspace (persisted across runs)
βββ IDENTITY.md # Agent name, role, emoji
βββ SOUL.md # Agent personality and behavioral guidelines (fallback)
βββ USER.md # User information
βββ MEMORY.md # Long-term memory (agent-writable)
βββ memory/ # Daily logs (YYYY-MM-DD.md, auto-managed)
βββ prompts/ # Versioned prompts for A/B testing
β βββ registry.json # Version metadata
β βββ v1/SOUL.md # Versioned behavioral prompts
βββ scores/ # Response quality scores (YYYY-MM-DD.jsonl)
βββ sessions/ # JSONL conversation files (gitignored)
The core is a while loop where each iteration is an Inngest step:
- Think β
step.run("think")calls the LLM via pi-ai'scomplete() - Act β if the LLM wants tools, each tool runs as
step.run("tool-read") - Observe β tool results are fed back into the conversation
- Repeat β until the LLM responds with text (no tools) or max iterations
Inngest auto-indexes duplicate step IDs in loops (think:0, think:1, etc.), so you don't need to track iteration numbers in step names.
One incoming message triggers multiple independent functions:
| Function | Purpose | Config |
|---|---|---|
agent-handle-message |
Run the agent loop | Singleton per chat, cancel on new message |
acknowledge-message |
Show "typing..." immediately | No retries (best effort) |
send-reply |
Format and send the response | 3 retries, channel dispatch |
agent-handle-score |
Evaluate response quality (async) | 1 retry, fires after reply sent |
agent-heartbeat |
Distill daily logs into long-term memory | Cron (every 30 min) |
evaluate-prompts |
Analyze scores, improve prompts | Cron (every 6 hours) |
agent-sub-agent |
Run isolated sub-agent for delegated tasks | 1 retry, sync or async |
global-failure-handler |
Catch errors, notify user | Triggered by inngest/function.failed |
The agent reads markdown files from the workspace directory and injects them into the system prompt:
| File | Purpose |
|---|---|
IDENTITY.md |
Agent name, role, and emoji |
SOUL.md |
Agent personality, behavioral guidelines, tone, boundaries |
USER.md |
Info about the user (name, timezone, preferences) |
MEMORY.md |
Curated long-term memory (agent-writable) |
Edit these files to customize your agent's personality and knowledge. The agent can also update MEMORY.md using the write tool to remember things across conversations.
Note: SOUL.md supports versioning for A/B testing β see Prompt Versioning below.
The agent has a two-tier memory system:
- Daily logs (
workspace/memory/YYYY-MM-DD.md) β append-only notes written via theremembertool during conversations - Long-term memory (
workspace/MEMORY.md) β curated summary distilled from daily logs by the heartbeat function
The agent-heartbeat function runs on a cron schedule (default: every 30 minutes). It checks if daily logs have accumulated enough content, then uses the LLM to distill them into MEMORY.md. Old daily logs are pruned after a configurable retention period (default: 30 days).
After each agent response, a lightweight LLM evaluates response quality across four dimensions:
| Dimension | Scale | Description |
|---|---|---|
| Relevance | 0-10 | Did the response address the user's question? |
| Completeness | 0-10 | Was anything important missing? |
| Tool Efficiency | 0-10 | Were tool calls necessary and well-targeted? |
| Tone Alignment | 0-10 | Did it match SOUL.md guidelines? |
Scores are persisted to workspace/scores/YYYY-MM-DD.jsonl as JSON lines:
{
"timestamp": "2026-03-12T...",
"sessionKey": "main",
"promptVersion": "v1",
"relevance": 8,
"completeness": 7,
"toolEfficiency": 9,
"tone": 8,
"composite": 8.0,
"rationale": "Addressed the core question..."
}How it works:
- Runs asynchronously after the reply is sent (doesn't block delivery)
- Composite score = simple average of the four dimensions
- Failures don't affect reply delivery
Configuration (env vars):
| Variable | Default | Description |
|---|---|---|
SCORING_ENABLED |
true |
Enable/disable scoring |
SCORING_PROVIDER |
anthropic |
Provider for scoring LLM |
SCORING_MODEL |
claude-3-5-haiku-20241022 |
Model for scoring |
The agent supports A/B testing of behavioral prompts through versioned SOUL.md files:
workspace/prompts/
βββ registry.json # Version metadata + active assignments
βββ v1/
β βββ SOUL.md # Baseline prompt
βββ v2/
β βββ SOUL.md # First variation
βββ v3/
βββ SOUL.md # Second variation
registry.json schema:
{
"versions": [
{
"id": "v1",
"created": "2026-03-15T00:00:00Z",
"source": "baseline",
"active": true,
"weight": 0.5
},
{
"id": "v2",
"created": "2026-03-16T00:00:00Z",
"source": "evaluation-pipeline",
"active": true,
"weight": 0.5,
"parentVersion": "v1"
}
],
"currentDefault": "v2"
}How it works:
- On each agent loop start, the context builder reads
registry.json - Selects a prompt version using weighted random selection from active versions
- Weights are normalized automatically if they don't sum to 1.0
- The selected version's
SOUL.mdis injected into the system prompt - Version ID is stored in session metadata and scoring logs for analysis
Automatic initialization:
- If
registry.jsonis missing, the system auto-creates a freshv1registry - Copies the current
workspace/SOUL.mdtoworkspace/prompts/v1/SOUL.md - If root
SOUL.mddoesn't exist, creates a default prompt
Configuration:
| Variable | Default | Description |
|---|---|---|
PROMPT_VERSIONING_ENABLED |
true |
Enable/disable prompt versioning |
The evaluation pipeline automatically analyzes scored responses and generates improved prompt versions. In production you can run it on whatever cadence gives the agent enough fresh data, such as nightly; by default this repo runs the evaluator every 6 hours.
The main lesson from the first runs was Goodhart's Law in miniature: once the prompt generator saw enough performance data, it began producing prompts that mirrored the evaluation criteria instead of simply improving the agent's behavior. The prompt-generation step now includes explicit output rules that prohibit scoring targets, metrics, and evaluation data from appearing in generated SOUL.md files.
For testing the evaluation pipeline, use a smaller model like Haiku instead of Sonnet. Sonnet produces consistently high-quality responses, making it difficult to generate the score variance needed to trigger prompt improvements:
# Switch to Haiku for testing
AGENT_MODEL=claude-3-5-haiku-20241022Haiku is more likely to produce varied scores across different question types, which helps the evaluation pipeline identify underperforming prompt versions and generate meaningful improvements.
How it works:
A cron function (evaluate-prompts) runs on a configurable schedule (default: every 6 hours):
- Load scores β Reads all JSONL files from
workspace/scores/ - Aggregate β Groups scores by prompt version, computes averages
- Identify underperformers β Finds versions with:
- Composite score below target (default: 7.0), OR
- Composite score significantly below best version (default: 1.0+ points gap)
- Generate improvements β Calls LLM with underperforming prompt + rationales to produce improved SOUL.md
- Create new versions β Writes improved prompts to
workspace/prompts/vN/SOUL.md - Promote winners β Versions with β₯80% traffic share and β₯1.0 point advantage over default become new default
- Enforce cap β If >5 active versions, retire lowest-scoring (v1 is never retired)
Weight redistribution:
- New versions start at 50% weight (configurable)
- Remaining weight distributed proportionally among other active versions
- Weights normalized automatically
Safety rails:
- Minimum data points required before any rewrite (default: 10)
- Maximum active versions cap (default: 5)
- Baseline (v1) is never deleted, only deprioritized
- New versions never start at 100% β always keep a control
- Generated prompts are forbidden from including scoring targets, metrics, or evaluation data
Configuration (env vars):
| Variable | Default | Description |
|---|---|---|
EVALUATION_CRON |
0 */6 * * * |
Cron schedule for evaluation runs |
EVAL_MIN_DATA_POINTS |
10 |
Minimum scores before version can be rewritten |
EVAL_TARGET_COMPOSITE |
7.0 |
Target composite score threshold |
EVAL_MAX_VERSIONS |
5 |
Maximum active versions before retirement |
EVAL_NEW_VERSION_WEIGHT |
0.5 |
Initial weight for new versions |
EVAL_PROMOTION_TRAFFIC |
0.8 |
Traffic share required for promotion |
EVAL_PROMOTION_SCORE_GAP |
1.0 |
Score advantage required for promotion |
EVAL_SIGNIFICANT_GAP |
1.0 |
Points below best to trigger rewrite |
Long conversations get summarized automatically so the agent doesn't lose context or hit token limits:
- Token estimation: Uses a chars/4 heuristic to estimate conversation size
- Threshold: Compaction triggers when estimated tokens exceed 80% of the configured max (150K)
- LLM summarization: Old messages are summarized into a structured checkpoint (goals, progress, decisions, next steps)
- Recent messages preserved: The most recent ~20K tokens of conversation are kept verbatim
- Persisted: The compacted session replaces the JSONL file, so it survives restarts
Compaction runs as an Inngest step (step.run("compact")), so it's durable and retryable.
Long tool results bloat the conversation context and cause the LLM to lose focus. The agent uses two-tier pruning:
- Soft trim: Tool results over 4K chars get head+tail trimmed (first 1,500 + last 1,500 chars)
- Hard clear: When total old tool content exceeds 50K chars, old results are replaced entirely
- Budget warnings: System messages are injected when iterations are running low
The agent is channel-agnostic. Each channel implements a ChannelHandler interface (src/channels/types.ts) with methods for sending replies, acknowledging messages, and setup. Each channel directory follows the same structure:
src/channels/<name>/
βββ handler.ts # ChannelHandler implementation (sendReply, acknowledge)
βββ api.ts # API client for the channel's platform
βββ setup.ts # Webhook setup automation
βββ transform.ts # Plain JS transform for Inngest webhook
βββ format.ts # Markdown β channel-specific format conversion
To add Discord, WhatsApp, or any other channel:
- Create a new directory under
src/channels/following the structure above - Implement the
ChannelHandlerinterface inhandler.ts - Write a webhook transform that converts the channel's payload to
agent.message.received - Register the channel in
src/channels/index.ts
The agent loop, reply dispatch, and acknowledgment functions are all channel-agnostic β no changes needed outside src/channels/.
connect()β WebSocket-based worker- Singleton execution β one run per chat at a time
- Step retries β automatic retry on LLM API failures
- Event-driven functions β compose behavior from small focused functions
- Webhook transforms β convert external payloads to typed events
- Checkpointing β near-zero inter-step latency
This project uses pi-ai (@mariozechner/pi-ai) by Mario Zechner for its unified LLM interface and @mariozechner/pi-coding-agent for it's. standard tools. pi-ai provides a single complete() function that works across Anthropic, OpenAI, Google, and other providers β making it easy to swap models without changing any agent code. It's a great library.
Apache-2.0