Time: 10 minutes | Level: Beginner | Prerequisites: Completed the Quickstart
In the quickstart, you built a calculator with a single command and watched Babysitter iterate to quality. Now let's understand exactly what happened under the hood. This knowledge will help you use Babysitter more effectively and debug issues when they arise.
- The Anatomy of a Babysitter Run
- Understanding the Run Directory
- The Event Journal Explained
- How Quality Convergence Works
- The TDD Methodology in Action
- Configuration and Customization
- Verifying Success
- Next Steps
When you typed /babysitter:call create a calculator with TDD and 80% quality target, here's the sequence of events:
Your Command
|
v
+-------------------+
| 1. Parse Request | Babysitter interprets your natural language
+-------------------+
|
v
+-------------------+
| 2. Create Run | A unique run ID and directory are created
+-------------------+
|
v
+-------------------+
| 3. Load Process | TDD Quality Convergence process is loaded
+-------------------+
|
v
+-------------------+
| 4. Execute Phases | Research -> Specs -> TDD Loop
+-------------------+
|
v
+-------------------+
| 5. Quality Loop | Iterate until target (80%) is met
+-------------------+
|
v
+-------------------+
| 6. Complete | Final results and summary
+-------------------+
Babysitter analyzed your prompt and extracted:
- Goal: Create a calculator module
- Methodology: TDD (Test-Driven Development)
- Quality Target: 80%
- Max Iterations: 5 (default)
A unique run was created with:
- Run ID:
01KFFTSF8TK8C9GT3YM9QYQ6WG(ULID format) - Directory:
.a5c/runs/01KFFTSF8TK8C9GT3YM9QYQ6WG/ - Journal: Empty, ready to record events
The TDD Quality Convergence process was loaded. This defines:
- Which phases to execute
- How to measure quality
- When to iterate vs. complete
The process ran through:
- Research Phase: Analyzed your codebase
- Specification Phase: Defined what to build
- Implementation Phase: TDD loop (write tests, implement, score)
Within the implementation phase:
- Iteration 1: Score 72/100 (below 80% target)
- Iteration 2: Score 88/100 (above target - success!)
Run marked as complete, final summary generated.
Let's explore what Babysitter created. Navigate to your run directory:
cd .a5c/runs/
lsYou'll see your run ID (e.g., 01KFFTSF8TK8C9GT3YM9QYQ6WG). Let's explore its structure:
.a5c/runs/01KFFTSF8TK8C9GT3YM9QYQ6WG/
|
+-- run.json # Run metadata
+-- inputs.json # Run inputs
|
+-- journal/
| +-- 000001.<ulid>.json # Event log (individual JSON files, source of truth)
| +-- 000002.<ulid>.json
| +-- ...
|
+-- state/
| +-- state.json # Current state cache (derived)
|
+-- tasks/
| +-- <effectId>/ # Task artifacts per effect
| +-- ...
|
+-- artifacts/
| +-- specifications.md # Generated specs
| +-- plan.md # Implementation plan
|
+-- code/
+-- main.js # Process definition used
The source of truth. Each event is stored as an individual JSON file named {SEQ}.{ULID}.json (e.g., 000001.01ARZ3NDEKTSV4RRFFQ69G5FAV.json). This directory is:
- Append-only (files are never modified, only new files are added)
- Human-readable (each file is a standalone JSON document)
- The basis for session resumption
A derived cache of current state. This is:
- Rebuilt from journal if deleted
- Used for fast state access
- Not the source of truth (journal is)
Contains artifacts from each task:
- Input parameters
- Output results
- Logs and intermediate files
Generated documents like:
- Specifications
- Plans
- Reports
The journal is the heart of Babysitter's persistence. Let's examine it:
# List all journal events (each is an individual JSON file)
ls .a5c/runs/01KFFTSF8TK8C9GT3YM9QYQ6WG/journal/
# View a specific event
cat .a5c/runs/01KFFTSF8TK8C9GT3YM9QYQ6WG/journal/000001.*.json | jq .Here's what each event type means:
Each event is stored as an individual JSON file in journal/ with the naming pattern {SEQ}.{ULID}.json. The event schema is:
// 000001.<ulid>.json
{"type":"RUN_CREATED","recordedAt":"2026-01-25T14:30:12Z","data":{"runId":"01KFF...","inputs":{}},"checksum":"sha256hex..."}
// (final event, e.g., 000012.<ulid>.json)
{"type":"RUN_COMPLETED","recordedAt":"2026-01-25T14:34:45Z","data":{"status":"success"},"checksum":"sha256hex..."}RUN_CREATED: A new run began with specific inputsRUN_COMPLETED: Run finished successfullyRUN_FAILED: Run finished with an error
Note: The seq number is derived from the filename, not stored in the event body. Each event includes a checksum field (sha256 hex) for integrity verification.
Effects represent tasks and interactions that Babysitter delegates (agent calls, skill invocations, scripts, breakpoints). There are exactly two effect event types:
// EFFECT_REQUESTED: An effect (task) has been requested
// e.g., 000003.<ulid>.json
{"type":"EFFECT_REQUESTED","recordedAt":"2026-01-25T14:30:45Z","data":{"effectId":"<effectId>","kind":"agent","args":{}},"checksum":"sha256hex..."}
// EFFECT_RESOLVED: An effect has completed (successfully or with error)
// e.g., 000004.<ulid>.json
{"type":"EFFECT_RESOLVED","recordedAt":"2026-01-25T14:31:10Z","data":{"effectId":"<effectId>","status":"ok","result":{}},"checksum":"sha256hex..."}
// EFFECT_RESOLVED with error status
// e.g., 000005.<ulid>.json
{"type":"EFFECT_RESOLVED","recordedAt":"2026-01-25T14:31:15Z","data":{"effectId":"<effectId>","status":"error","error":"..."},"checksum":"sha256hex..."}EFFECT_REQUESTED: Records when a task, agent call, or breakpoint is initiatedEFFECT_RESOLVED(status: ok): Records successful completion of an effectEFFECT_RESOLVED(status: error): Records when an effect fails
Task artifacts are stored in tasks/<effectId>/ directories containing task.json, input.json, result.json, output.json, stdout.log, and stderr.log.
Breakpoints are modeled as effects. When human approval is needed:
// Breakpoint requested as an effect
{"type":"EFFECT_REQUESTED","recordedAt":"...","data":{"effectId":"<effectId>","kind":"breakpoint","question":"Deploy to prod?"},"checksum":"sha256hex..."}
// Breakpoint resolved (approved or rejected)
{"type":"EFFECT_RESOLVED","recordedAt":"...","data":{"effectId":"<effectId>","status":"ok","approver":"user"},"checksum":"sha256hex..."}Note on breakpoint modes: These events are recorded regardless of whether the breakpoint was handled:
- Interactively (via AskUserQuestion in Claude Code chat), or
- Non-interactively (via the breakpoints web UI at http://localhost:3184)
Note on quality tracking: Quality scores and iteration/phase progress are not tracked as separate event types in the journal. Quality metrics can be tracked within effect data or via custom application logic on top of the five core event types: RUN_CREATED, EFFECT_REQUESTED, EFFECT_RESOLVED, RUN_COMPLETED, and RUN_FAILED.
The journal enables:
- Deterministic Replay: Given the same inputs and journal, you get the same state
- Session Resumption: Replay events to restore exactly where you left off
- Audit Trail: Complete history of what happened and when
- Debugging: Trace through events to find issues
Quality convergence is Babysitter's core value proposition. Here's how it works:
+------------------+
| Write Tests |
+------------------+
|
v
+------------------+
| Implement Code |
+------------------+
|
v
+------------------+
| Run Quality |
| Checks |
+------------------+
|
v
+------------------+
| Score Quality |---> Score >= Target? ---> Done!
+------------------+ |
^ | No
| v
+-----------------------+
Continue loop
For your calculator run, these metrics were evaluated:
| Metric | Iteration 1 | Iteration 2 | Weight |
|---|---|---|---|
| Tests Passing | 11/12 (92%) | 12/12 (100%) | 40% |
| Code Coverage | 75% | 92% | 30% |
| Linting | 2 warnings | 0 warnings | 15% |
| Complexity | Low | Low | 15% |
Weighted Score Calculation:
- Iteration 1:
(0.92 * 40) + (0.75 * 30) + (0.80 * 15) + (1.0 * 15) = 72 - Iteration 2:
(1.0 * 40) + (0.92 * 30) + (1.0 * 15) + (1.0 * 15) = 88
The quality score isn't just automated metrics. An AI agent also evaluates:
- Code readability
- Best practices adherence
- Error handling quality
- Documentation completeness
This hybrid approach catches issues that pure metrics miss.
You can customize targets in your prompts:
# Conservative (high quality)
/babysitter:call build feature with TDD and 90% quality target
# Balanced (default-ish)
/babysitter:call build feature with TDD and 80% quality target
# Fast (lower quality, fewer iterations)
/babysitter:call build feature with TDD and 70% quality target
Higher targets = more iterations = longer runtime = higher quality
The quickstart used the TDD Quality Convergence methodology. TDD (shorthand used throughout this guide) combines test-first development with iterative quality improvement. Here's what it does:
Purpose: Understand the context before coding
What happens:
- Analyze existing codebase structure
- Identify coding patterns and conventions
- Detect test framework (Jest, Mocha, etc.)
- Note dependencies and constraints
Output: Research summary with recommendations
Purpose: Define what to build before building it
What happens:
- Create detailed specifications from your request
- Define function signatures and interfaces
- List test cases to write
- Create implementation plan
Output: artifacts/specifications.md
Purpose: Build with quality through iteration
Each iteration:
-
Write Tests First
- Create test files with test cases
- Tests should fail (code doesn't exist yet)
-
Implement Code
- Write minimal code to pass tests
- Follow specifications from Phase 2
-
Run Quality Checks
- Execute tests
- Measure coverage
- Run linting
- Check complexity
-
Score Quality
- Calculate weighted score
- Compare to target
- If below target, identify improvements
-
Iterate or Complete
- Below target? Fix issues and repeat
- Above target? Mark as complete
TDD and Babysitter are a natural fit because:
- Clear success criteria: Tests define when you're done
- Measurable progress: Test pass rate and coverage are numbers
- Incremental improvement: Each iteration fixes specific test failures
- Quality guarantee: Passing tests = working code
You can customize Babysitter's behavior in several ways:
# Set quality target
/babysitter:call build API with 85% quality target
# Set max iterations
/babysitter:call build API with max 10 iterations
# Combine options
/babysitter:call build API with TDD, 90% quality, max 8 iterationsDifferent methodologies for different needs:
| Methodology | Best For | Quality Focus |
|---|---|---|
| TDD Quality Convergence | Feature development | High |
| GSD (Get Shit Done) | Quick prototypes | Medium |
| Spec-Kit | Complex specifications | High |
# Explicit methodology selection
/babysitter:call build feature using TDD methodology
/babysitter:call prototype using GSD methodologyPrevent runaway loops:
# Low limit (fast, may not reach target)
/babysitter:call build feature with max 3 iterations
# High limit (thorough, takes longer)
/babysitter:call build feature with max 15 iterationsIf max iterations reached without meeting quality target, Babysitter completes with a warning.
How do you know your Babysitter run succeeded? Here's a checklist:
| Check | How to Verify | Expected |
|---|---|---|
| Run completed | Check run summary | "Run completed successfully" |
| Quality met | Check final score | Score >= your target |
| Tests passing | Run npm test |
All tests pass |
| Files created | ls your directory |
New implementation files |
| Journal complete | Check last event | RUN_COMPLETED with success |
# Check run status
cat .a5c/runs/<runId>/state/state.json | jq '.status'
# Expected: "completed"
# View the last journal event (check for RUN_COMPLETED)
ls .a5c/runs/<runId>/journal/ | sort | tail -1 | xargs -I {} cat .a5c/runs/<runId>/journal/{} | jq '.type'
# Expected: "RUN_COMPLETED"
# Run tests manually
npm test
# Expected: All passing
# Check for the implementation
ls -la calculator.js calculator.test.js
# Expected: Both files existRun failed:
# Check the journal for RUN_FAILED or error events
for f in .a5c/runs/<runId>/journal/*.json; do cat "$f" | jq 'select(.type == "RUN_FAILED" or (.type == "EFFECT_RESOLVED" and .data.status == "error"))'; doneQuality not reached:
# View all EFFECT_RESOLVED events to check task results
for f in .a5c/runs/<runId>/journal/*.json; do cat "$f" | jq 'select(.type == "EFFECT_RESOLVED")'; doneIncomplete run:
# Resume and continue
claude "/babysitter:call resume the babysitter run <runId>"Let's practice what you've learned. Complete these exercises:
How many iterations did your run take?
# Your command here (count EFFECT_REQUESTED events as a proxy for tasks per iteration):
for f in .a5c/runs/<your-run-id>/journal/*.json; do cat "$f" | jq -r 'select(.type == "EFFECT_REQUESTED") | .type'; done | wc -lAnswer: The number of effects requested gives you insight into the work performed across iterations.
What was the quality score after each iteration?
# Your command here (view all EFFECT_RESOLVED events to see task results):
for f in .a5c/runs/<your-run-id>/journal/*.json; do cat "$f" | jq 'select(.type == "EFFECT_RESOLVED") | .data'; doneExpected: Effect results showing progression toward quality target.
How many tasks were executed?
# Your command here:
for f in .a5c/runs/<your-run-id>/journal/*.json; do cat "$f" | jq -r 'select(.type == "EFFECT_REQUESTED") | .type'; done | wc -lHow long did the run take?
# Find start and end times (first and last journal files)
cat .a5c/runs/<your-run-id>/journal/$(ls .a5c/runs/<your-run-id>/journal/ | sort | head -1) | jq '.recordedAt'
cat .a5c/runs/<your-run-id>/journal/$(ls .a5c/runs/<your-run-id>/journal/ | sort | tail -1) | jq '.recordedAt'| Term | Definition |
|---|---|
| Run | A single execution of a Babysitter workflow |
| Run ID | Unique identifier for a run (ULID format) |
| Journal | Append-only event log, source of truth |
| Iteration | One pass through the quality loop |
| Quality Score | Weighted metric combining tests, coverage, etc. |
| Breakpoint | Human approval checkpoint |
| Process | Definition of workflow phases and logic |
| File | Purpose |
|---|---|
journal/*.json |
Event log as individual JSON files (never delete!) |
state/state.json |
State cache (can be rebuilt) |
tasks/ |
Task artifacts |
artifacts/ |
Generated documents |
# View all journal events
for f in .a5c/runs/<runId>/journal/*.json; do cat "$f" | jq .; done
# Check run status
cat .a5c/runs/<runId>/state/state.json | jq '.status'
# Resume a run
claude "Resume the babysitter run <runId>"
# List all runs
ls -la .a5c/runs/Now that you understand what happened in your first run, you're ready to explore more:
-
Try Different Quality Targets
/babysitter:call add validation to calculator with 90% quality -
Experience Breakpoints
/babysitter:call refactor calculator with breakpoint approval before changesClaude will ask you directly in the chat when approval is needed!
(For non-interactive mode, you'd approve at http://localhost:3184)
-
Test Session Resumption
- Start a longer run
- Interrupt it (Ctrl+C or close Claude Code)
- Resume with
/babysitter:call resume
- Read TDD Methodology Deep Dive
- Try the GSD Methodology for faster prototyping
- Learn about Parallel Execution
- Create Custom Processes
Print this for your desk:
BABYSITTER QUICK REFERENCE
==========================
START A RUN:
/babysitter:call <request> with TDD and <X>% quality
RESUME A RUN:
/babysitter:call resume
/babysitter:call resume --run-id <id>
VIEW JOURNAL:
for f in .a5c/runs/<id>/journal/*.json; do cat "$f" | jq .; done
CHECK STATUS:
cat .a5c/runs/<id>/state/state.json | jq '.status'
BREAKPOINTS:
Interactive (Claude Code): Handled in chat - no setup!
Non-Interactive: npx -y @a5c-ai/babysitter-sdk@latest breakpoints:start
Web UI (non-interactive): http://localhost:3184
LIST ALL RUNS:
ls .a5c/runs/
KEY EVENT TYPES (exactly 5):
RUN_CREATED, RUN_COMPLETED, RUN_FAILED
EFFECT_REQUESTED, EFFECT_RESOLVED
JOURNAL FORMAT:
Individual JSON files: journal/{SEQ}.{ULID}.json
Fields: type, recordedAt, data, checksum
Congratulations! You now understand how Babysitter works under the hood. This knowledge will help you use it more effectively, debug issues when they arise, and eventually create your own custom processes.
Happy orchestrating!