You've built an agent. You've added memory, tools, behavioral patterns.
Now: how do you know if it's actually better?
This is the hard part. Most people skip it and just vibe-check.
"Better" is fuzzy. Better at what? For whom? In what situation?
You can't improve what you can't measure. But measuring agent behavior is tricky because:
- Outputs are non-deterministic (same input → different outputs)
- "Good" is subjective (helpful to one user might be annoying to another)
- Context matters (a behavior that helps in one case might hurt in another)
Run it, see if it feels better.
Pros: Fast, intuitive Cons: Unreliable, biased, doesn't scale
You'll convince yourself it's better because you want it to be.
Create a set of test cases with expected behaviors:
test_cases = [
{
"input": "What's my favorite color?",
"setup": {"memory": "user likes blue"},
"expected_behavior": "should check memory before answering",
"expected_contains": "blue"
},
{
"input": "Write a todo app",
"expected_behavior": "should do pre-game routine",
"expected_contains": ["TASK:", "APPROACH:"]
},
{
"input": "What did we talk about yesterday?",
"setup": {"memory": None},
"expected_behavior": "should admit it doesn't know",
"expected_not_contains": ["we discussed", "you mentioned"]
}
]
Run all test cases, check if behavior matches expectations.
Pros: Repeatable, catches regressions Cons: Tedious to create, can miss emergent issues
Run two versions of your agent on the same inputs, compare outputs.
def compare_agents(prompt, agent_a, agent_b):
response_a = agent_a.run(prompt)
response_b = agent_b.run(prompt)
# Either: human judges which is better
# Or: use another LLM to judge
return get_preference(response_a, response_b)
Pros: Directly measures "better" Cons: Expensive, slow, still subjective
Use a model to evaluate another model's output:
judge_prompt = f"""
Rate this response on a scale of 1-5 for:
- Helpfulness
- Accuracy
- Conciseness
User asked: {user_input}
Agent responded: {agent_response}
Provide scores and brief justification.
"""
evaluation = judge_model.run(judge_prompt)
Pros: Scalable, consistent criteria Cons: Judge model has its own biases
Instead of judging output quality, check if specific behaviors happened:
def check_behaviors(conversation_log):
checks = {
"checked_memory_before_recall": False,
"did_pregame_routine": False,
"saved_checkpoint": False,
"admitted_uncertainty": False,
}
for turn in conversation_log:
if "memory_search" in turn.tool_calls:
checks["checked_memory_before_recall"] = True
if "TASK:" in turn.content and "APPROACH:" in turn.content:
checks["did_pregame_routine"] = True
# ... etc
return checks
Pros: Objective, easy to automate Cons: Behavior ≠ quality (can follow rules but still be unhelpful)
- Did it check memory before answering recall questions?
- Did it do a pre-game routine before complex tasks?
- Did it checkpoint during long tasks?
- Did it use available tools or just hallucinate?
- Did the response contain the correct information?
- Did it complete the requested task?
- Was it concise or rambling?
- Did it hallucinate facts?
- Did it claim to remember things it couldn't know?
- Did it skip safety checks?
For your agent, I'd suggest:
- Define 10-20 test cases covering key scenarios
- Run each weekly (or after changes)
- Check both behaviors AND outcomes
- Log everything — conversations, tool calls, timing
- Review failures — why did it mess up?
Start simple. A spreadsheet with pass/fail is better than nothing.
Evaluation is where "vibes-based development" becomes "engineering."
Without measurement, you're just guessing. With measurement, you can:
- Prove that a change helped (or hurt)
- Catch regressions before users do
- Make principled tradeoffs
The best agents aren't the ones with the most features. They're the ones that reliably do what they're supposed to.
Create 5 test cases for the memory agent:
- Does it check memory before answering recall questions?
- Does it admit when it doesn't know something?
- Does it do a pre-game routine for complex tasks?
- Does it save important information when asked to remember?
- Does it retrieve saved information correctly?
Then run them and see what passes.