A modern, minimal CLI for an AI coding assistant
A conversational REPL interface that makes AI-assisted coding feel natural and transparent. Clean aesthetics, clear tool visibility, and intelligent defaults that stay out of your way.
- Transparency - Always show what tools are being called and their results
- Minimal friction - Smart defaults, auto-permissions for trusted repos
- Clean aesthetic - Thoughtful color-coding, whitespace, no visual clutter
- Progressive disclosure - Simple by default, power features when needed
- Workflow-native - Built-in commands for common workflows (git, specs, docs)
- Self-healing - When tools fail, fix them automatically
On launch, show a welcome screen with session options:
┌─────────────────────────────────────────────────────┐
│ │
│ ██████╗ ██████╗ ██████╗ ███████╗ │
│ ██╔════╝██╔═══██╗██╔══██╗██╔════╝ │
│ ██║ ██║ ██║██║ ██║█████╗ │
│ ██║ ██║ ██║██║ ██║██╔══╝ │
│ ╚██████╗╚██████╔╝██████╔╝███████╗ │
│ ╚═════╝ ╚═════╝ ╚═════╝ ╚══════╝ │
│ │
│ coding-agent v0.1.0 │
│ │
│ [n] New session │
│ [r] Resume last session │
│ └─ "Adding auth to dashboard" (2h ago) │
│ │
│ [h] Help [c] Config │
│ │
└─────────────────────────────────────────────────────┘
Startup options:
- New session - Fresh context, new conversation
- Resume last - Restore previous session from SpecStory history
- Shows preview of last session (title/summary + time ago)
- Conversational REPL - Multi-line input support, natural back-and-forth
- Input submission -
Enteradds newline,Double-Entersubmits - Slash commands - Extensible command system (see Commands section)
- Command history - Persistent across sessions, searchable
- Auto-complete - For slash commands and file paths
- Tool call visibility - Standard verbosity: action + target (e.g., "Reading src/auth/mod.rs")
- Streaming responses - Real-time output as the agent thinks/writes
- Result summaries - Compact, scannable summaries of tool results
- Syntax highlighting - For code blocks in responses
- Collapsible sections - For verbose output (file contents, long diffs)
- Multi-agent status bar - Visual progress for parallel agent tasks
- Context bar - Sliding bar at bottom showing context used vs. remaining
A persistent visual indicator at the bottom of the CLI showing context window usage:
┌─────────────────────────────────────────────────────┐
│ Context: ████████████░░░░░░░░░░░░░░░░░░ 38% used │
│ 76k / 200k tokens │
└─────────────────────────────────────────────────────┘
- Position: Bottom of screen, always visible
- Shows cumulative token usage for the session
- Color-coded: green (0-60%), yellow (60-85%), red (85%+)
- Cost displayed separately via
/costcommand (not in bar) - Updates after each message exchange
- Trusted paths (auto-yes for all operations):
/Users/charliposner/coding-agent(this repo)~/Documents/Personal/(Obsidian vault)
- Other locations - Prompt for permission before writing/modifying
- Read-only operations - Always allowed everywhere
Conversations are automatically saved using SpecStory format:
- Auto-save location:
.specstory/history/in project root - Format: Markdown files, human-readable
- Behavior: Saves after each exchange, no manual intervention needed
- Benefits:
- Searchable conversation history
- Can be version-controlled
- Readable by humans and machines
- Resume sessions across restarts
| Command | Description |
|---|---|
/help |
Show available commands |
/clear |
Reset display AND context - starts fresh conversation |
/status |
Show agent status, active tasks, running agents |
/config |
Open/edit configuration |
/history |
Browse conversation history (SpecStory) |
Ctrl+C |
Cancel current operation |
Ctrl+D |
Exit |
/clear behavior:
- Clears the terminal display
- Resets the conversation context (fresh start)
- Saves current conversation to SpecStory before clearing
- Resets context bar to 0%
| Command | Description |
|---|---|
/undo |
Undo last file change |
/diff |
Show pending/recent changes |
/context |
Show current context (loaded files, working directory, token usage) |
| Command | Description |
|---|---|
/commit |
Agent analyzes changes and commits with purpose-focused message |
/commit --pick |
Interactive file picker for manual selection |
/commit behavior:
- Default (no flags): Agent analyzes all changes, decides what's logically related, commits
- With
--pick: Shows interactive file picker for user selection - Generates 3-sentence commit message focused on purpose not code
- Preview message, allow edit, then commit
Example commit message style:
Add user authentication flow for the dashboard
This enables users to securely log in before accessing
sensitive data. Implements JWT-based session management
with automatic token refresh.
| Command | Description |
|---|---|
/model [name] |
Switch AI model (show current if no arg) |
/cost |
Show detailed token usage and cost breakdown |
/spec [name] |
Create new spec file and enter planning mode |
/document [topic] |
Add or update an Obsidian note |
/spec behavior:
- Create
specs/<name>.mdif it doesn't exist - Enter planning/discussion mode
- Collaborative back-and-forth to build out the spec
- Agent suggests structure, user refines
/document behavior:
- Search existing notes in
~/Documents/Personal/for related content - If updating: show existing note, propose changes
- If new: suggest file location and initial structure
- Write/update the markdown file
Base palette (minimal, modern):
| Element | Color | Purpose |
|---|---|---|
| User input | White/default | Clean, neutral |
| Agent response | Cyan/light blue | Distinguish from user |
| Tool calls | Yellow/amber | Attention, action happening |
| Success | Green | Completion, confirmation |
| Error | Red | Problems, failures |
| Warning | Orange | Caution, needs attention |
| Muted/secondary | Gray | Less important info |
| Cost/tokens | Magenta | Resource usage |
| Context bar | Green → Yellow → Red | Usage level indicator |
Full screen layout:
┌─────────────────────────────────────────────────────┐
│ coding-agent v0.1.0 claude-3-opus │ <- Header (version + model)
├─────────────────────────────────────────────────────┤
│ │
│ You: What files handle authentication? │
│ │
│ ● Reading src/auth/mod.rs │
│ ● Reading src/auth/jwt.rs │
│ ✓ Found 3 authentication files │
│ │
│ Agent: Found 3 files related to authentication: │
│ │
│ src/auth/mod.rs - Main auth module │
│ src/auth/jwt.rs - JWT token handling │
│ src/middleware/auth.rs - Auth middleware │
│ │
│ │
│ │ <- Conversation area (scrolls)
│ │
├─────────────────────────────────────────────────────┤
│ > _ │ <- Input area
├─────────────────────────────────────────────────────┤
│ Context: ████████░░░░░░░░░░░░░░░░░░░░ 25% │ 50k │ <- Context bar (bottom)
└─────────────────────────────────────────────────────┘
Multi-agent status bar (when active):
┌─────────────────────────────────────────────────────┐
│ AGENTS ─────────────────────────────────────────── │
│ ● search-agent Searching for auth files... ██░░ │
│ ● refactor-agent Analyzing dependencies... ███░ │
│ ○ test-agent Queued │
└─────────────────────────────────────────────────────┘
When the agent is working, show contextual "thinking" messages:
Standard thinking messages (rotate):
- "Pondering..."
- "Percolating..."
- "Cogitating..."
- "Mulling it over..."
- "Connecting dots..."
Long wait mode (>10 seconds): When operations take longer, fetch and display fun facts via API:
● Refactoring auth module...
Did you know? The first computer bug was an actual bug—
a moth found in Harvard's Mark II computer in 1947.
━━━━━━━━━━━━━━━━━━━━━━░░░░░░░░░░ 65%
Configurable: Users can disable fun facts if they prefer minimal output.
Fun facts are fetched from external APIs for variety:
Potential API sources:
- uselessfacts.jsph.pl - Random fun facts
- official-joke-api - Programming jokes
- quotable.io - Motivational quotes
- Custom curated API (future) - Programming-specific facts
Implementation:
- Cache facts locally for offline use
- Fetch new batch periodically in background
- Fallback to curated list if API unavailable
- Filter for programming/tech relevance where possible
- Spinners - Contextual spinners showing current operation
- Progress bars - For long operations, multi-agent tasks
- Subtle transitions - Smooth appearance of new content
- No excessive animation - Keep it professional (except fun facts)
When a tool fails, the system doesn't just report the error—it tries to fix it.
1. Tool fails with error
2. Analyze error type:
- Code error? → Spawn fix-agent
- Permission error? → Request permission or suggest fix
- Network error? → Retry with backoff
- Resource error? → Suggest alternatives
3. If code error:
a. fix-agent diagnoses the issue
b. Proposes and applies fix
c. Writes regression test to prevent recurrence
d. Re-runs original tool
4. Report resolution to user
● Running build...
✗ Build failed: missing dependency 'serde_json'
→ Diagnosing issue...
→ Found: Cargo.toml missing serde_json dependency
→ Applying fix...
+ serde_json = "1.0" (added to Cargo.toml)
→ Writing test to verify fix...
→ Re-running build...
✓ Build succeeded (auto-fixed missing dependency)
| Error Type | Auto-Fix Action |
|---|---|
| Missing dependency | Add to Cargo.toml/package.json |
| Type mismatch | Suggest/apply type annotation |
| Import missing | Add import statement |
| Syntax error | Attempt correction |
| Test failure | Analyze and propose fix |
| Permission denied | Request elevation or suggest chmod |
When auto-fixing code issues, the system generates a regression test:
// Auto-generated test for fix: missing serde_json dependency
#[test]
fn test_json_serialization_available() {
// Ensures serde_json is properly configured
let value = serde_json::json!({"test": true});
assert!(value.is_object());
}- Primary: macOS
- Goal: Cross-platform (macOS, Linux, Windows)
- Use cross-platform crates, test on multiple platforms
| Layer | Crate | Purpose |
|---|---|---|
| Prompts & UI | cliclack |
Beautiful, minimal prompts and styled output |
| Spinners/Progress | indicatif |
Spinners, progress bars, multi-progress |
| Terminal handling | crossterm |
Cross-platform terminal control |
| Colors/Styling | console |
Terminal styling (used by indicatif) |
| TUI (status bar) | ratatui |
Multi-agent status bar, context bar |
| Async runtime | tokio |
Async operations, streaming |
| HTTP client | reqwest |
Fun facts API calls |
| Serialization | serde + toml |
Config files, history persistence |
| CLI parsing | clap |
Argument parsing, subcommands |
| Git operations | git2 |
Native git integration for /commit |
| File watching | notify |
Watch for external file changes |
| Token counting | tiktoken-rs |
Accurate token counting for context bar |
src/
├── main.rs # Entry point
├── cli/
│ ├── mod.rs
│ ├── repl.rs # Main REPL loop, input handling
│ ├── startup.rs # Welcome screen, session selection
│ ├── input.rs # Multi-line input, double-enter detection
│ ├── output.rs # Rendering, streaming, formatting
│ ├── history.rs # SpecStory integration
│ └── commands/
│ ├── mod.rs # Command registry
│ ├── help.rs
│ ├── clear.rs # Clear + reset context
│ ├── status.rs
│ ├── config.rs
│ ├── model.rs
│ ├── cost.rs
│ ├── context.rs
│ ├── commit.rs # Git commit workflow
│ ├── spec.rs # Spec creation workflow
│ └── document.rs # Obsidian integration
├── ui/
│ ├── mod.rs
│ ├── theme.rs # Colors, styling
│ ├── spinner.rs # Tool call spinners
│ ├── thinking.rs # Thinking messages
│ ├── fun_facts.rs # API integration for fun facts
│ ├── context_bar.rs # Context usage visualization (bottom)
│ ├── status_bar.rs # Multi-agent status bar
│ └── components.rs # Reusable UI pieces
├── agents/
│ ├── mod.rs
│ ├── manager.rs # Multi-agent orchestration
│ ├── status.rs # Agent state tracking
│ └── fix_agent.rs # Self-healing error recovery
├── tools/
│ ├── mod.rs
│ ├── executor.rs # Tool execution with error handling
│ └── recovery.rs # Error recovery strategies
├── permissions/
│ ├── mod.rs
│ └── trusted.rs # Trusted path logic
├── integrations/
│ ├── mod.rs
│ ├── git.rs # Git operations
│ ├── obsidian.rs # Obsidian vault operations
│ └── specstory.rs # Conversation persistence
├── tokens/
│ ├── mod.rs
│ ├── counter.rs # Token counting
│ └── context.rs # Context window management
└── config/
├── mod.rs
└── settings.rs # User configuration
Location: ~/.config/coding-agent/config.toml
[permissions]
trusted_paths = [
"/Users/charliposner/coding-agent",
"~/Documents/Personal/",
]
auto_read = true
[model]
default = "claude-3-opus"
available = ["claude-3-opus", "claude-3-sonnet", "gpt-4", "gpt-4-turbo"]
context_window = 200000 # tokens
[theme]
style = "minimal" # minimal | colorful | monochrome
[persistence]
enabled = true
format = "specstory" # specstory | markdown | json
path = ".specstory/history/"
[behavior]
streaming = true
tool_verbosity = "standard" # minimal | standard | verbose
show_context_bar = true
fun_facts = true # Show fun facts during long waits
fun_fact_delay = 10 # Seconds before showing fun facts
[fun_facts_api]
enabled = true
sources = ["uselessfacts", "jokes", "quotes"]
cache_size = 100
refresh_interval = 3600 # seconds
[error_recovery]
auto_fix = true
generate_tests = true
max_retry_attempts = 3
[integrations.obsidian]
vault_path = "~/Documents/Personal/"
[integrations.git]
auto_stage = false
commit_style = "purpose" # purpose | conventional | simple/commit [--pick] [--all] [--amend] [files...]
Options:
--pick, -p Interactive file picker (override agent decision)
--all, -a Stage all modified files
--amend Amend the previous commit
files... Specific files to stage
Default Flow (agent decides):
1. Agent analyzes all staged/unstaged changes
2. Groups logically related changes
3. Commits with purpose-focused message
4. May suggest splitting into multiple commits
Interactive Flow (--pick):
1. Show git status
2. Interactive file selection with checkboxes
3. Generate commit message for selected files
4. Preview, edit, commit
Example:
> /commit
Analyzing changes...
Agent recommends committing:
✓ src/auth/login.rs - Core login logic
✓ src/auth/jwt.rs - JWT handling
○ tests/auth_test.rs - (suggests separate commit)
Proposed commit message:
┌────────────────────────────────────────────────┐
│ Add JWT token validation to login flow │
│ │
│ Users can now stay logged in across sessions │
│ with secure token refresh. This improves UX │
│ by eliminating repeated login prompts. │
└────────────────────────────────────────────────┘
[Enter] Commit [e] Edit [a] Add more files [c] Cancel
/spec <name>
Creates: specs/<name>.md
Flow:
1. Create spec file with template
2. Enter planning mode
3. Back-and-forth discussion to refine
4. Agent suggests structure, user guides direction
5. Exit planning mode when satisfied
Example:
> /spec authentication
Created specs/authentication.md
Entering planning mode...
Let's design the authentication system.
What's the primary auth method you want to support?
1. Email/password
2. OAuth (Google, GitHub, etc.)
3. Magic links
4. Multiple options
/document <topic> [--new] [--search]
Options:
--new Force create new note (don't search existing)
--search Just search, don't create/edit
Flow:
1. Search vault for related notes
2. If matches found:
- Show matches
- User selects to update or create new
3. If updating: show current content, propose changes
4. If new: suggest location, create with structure
5. Write to vault
Example:
> /document rust error handling
Found related notes:
1. Programming/Rust/Basics.md (mentions error handling)
2. Programming/Rust/Result-Type.md
[1-2] Update existing [n] New note [c] Cancel
/cost
Shows detailed breakdown of session costs:
Session Cost Breakdown
──────────────────────────────────────────────
Model: claude-3-opus
Input tokens: 45,230 ($0.675)
Output tokens: 12,450 ($0.935)
──────────────────────────────────────────────
Total: 57,680 ($1.61)
Context used: 57,680 / 200,000 (29%)
This session: 45 messages over 2h 15m
Each phase has clear deliverables, testing requirements, and stopping conditions. Phases build on each other—complete each fully before moving on.
tests/
├── unit/ # Pure function tests (no I/O)
│ ├── config_test.rs
│ ├── token_counter_test.rs
│ ├── command_parser_test.rs
│ └── ...
├── integration/ # Tests with real I/O (files, git)
│ ├── git_integration_test.rs
│ ├── specstory_test.rs
│ └── ...
├── ui/ # Snapshot tests for UI output
│ ├── snapshots/
│ ├── spinner_test.rs
│ ├── context_bar_test.rs
│ └── ...
├── e2e/ # End-to-end workflow tests
│ ├── repl_session_test.rs
│ ├── commit_workflow_test.rs
│ └── ...
└── mocks/ # Shared test doubles
├── mock_api.rs
├── mock_terminal.rs
└── mock_git.rs
Test separation from agent tools: CLI tests live in tests/ and test the CLI itself. Agent tool tests (if any) would be in a separate agent/tests/ directory and test the AI agent's tool implementations. The CLI tests should never depend on actual AI responses—mock the agent interface.
Goal: Runnable binary that accepts multi-line input with double-enter submission.
Deliverables:
- Cargo project with workspace structure
- Dependencies:
crossterm,tokio,clap - Terminal raw mode handling (enter/exit cleanly)
- Multi-line input buffer
- Double-enter detection for submission
-
Ctrl+Ccancellation,Ctrl+Dexit - Basic input echo (what you type appears on screen)
Files to create:
src/
├── main.rs
├── cli/
│ ├── mod.rs
│ ├── input.rs # Input handling, key events
│ └── terminal.rs # Raw mode, cleanup
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_double_enter_detection |
Unit | Two consecutive enters triggers submit |
test_single_enter_adds_newline |
Unit | Single enter adds \n to buffer |
test_ctrl_c_clears_input |
Unit | Ctrl+C empties current input buffer |
test_backspace_removes_char |
Unit | Backspace works across lines |
test_terminal_cleanup_on_panic |
Integration | Terminal restored after panic |
test_unicode_input |
Unit | Handles emoji, CJK characters |
Edge cases to handle:
- Terminal resize during input
- Very long lines (horizontal scroll or wrap?)
- Paste with embedded newlines
- Non-UTF8 input (reject gracefully)
Stopping condition:
✓ cargo run starts successfully
✓ Can type multi-line input
✓ Double-enter submits and prints "You entered: {input}"
✓ Ctrl+C clears current input
✓ Ctrl+D exits cleanly
✓ All unit tests pass
✓ Terminal is always restored (even on crash)
Goal: Working REPL with slash command infrastructure and basic commands.
Deliverables:
- REPL loop (read → parse → execute → display → repeat)
- Slash command parser (
/command arg1 arg2) - Command registry pattern (easy to add new commands)
-
/help- lists all commands -
/clear- clears screen (context reset comes later) -
/exit- clean exit - Config system with
serde+toml - Config file loading from
~/.config/coding-agent/config.toml - Default config generation on first run
Files to create:
src/
├── cli/
│ ├── repl.rs # Main loop
│ └── commands/
│ ├── mod.rs # Command trait, registry
│ ├── help.rs
│ ├── clear.rs
│ └── exit.rs
├── config/
│ ├── mod.rs
│ └── settings.rs
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_slash_command_parsing |
Unit | /cmd arg1 arg2 parses correctly |
test_command_registry_lookup |
Unit | Commands found by name |
test_unknown_command_error |
Unit | Unknown /foo gives helpful error |
test_config_default_generation |
Unit | Missing config creates valid default |
test_config_load_valid |
Unit | Valid TOML loads correctly |
test_config_load_invalid |
Unit | Invalid TOML gives clear error |
test_config_merge_partial |
Unit | Partial config merges with defaults |
test_help_lists_all_commands |
Integration | /help output contains all registered commands |
Edge cases to handle:
- Config file has unknown keys (ignore with warning, don't fail)
- Config directory doesn't exist (create it)
- Config file has wrong permissions (warn, use defaults)
- Slash command with no arguments vs with arguments
Stopping condition:
✓ REPL loop runs continuously until /exit or Ctrl+D
✓ /help shows all available commands with descriptions
✓ /clear clears the terminal
✓ Unknown commands show "Unknown command. Try /help"
✓ Config file created on first run at ~/.config/coding-agent/config.toml
✓ Config changes take effect (e.g., change a value, restart, verify)
✓ All tests pass
Goal: Beautiful, styled output with spinners and colors.
Deliverables:
- Theme system (colors defined in one place)
- Colored output for different elements (user input, agent, tools, errors)
- Spinner component with customizable messages
- Progress bar component
- Styled message boxes (for commit messages, etc.)
- Syntax highlighting for code blocks (use
syntect) -
/configcommand to open config in$EDITOR
Files to create:
src/
├── ui/
│ ├── mod.rs
│ ├── theme.rs # Color definitions
│ ├── spinner.rs # Animated spinner
│ ├── progress.rs # Progress bar
│ ├── output.rs # Styled printing functions
│ └── components.rs # Message boxes, etc.
├── cli/commands/
│ └── config.rs
Dependencies to add: indicatif, console, syntect
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_theme_all_colors_defined |
Unit | No missing color definitions |
test_spinner_messages_cycle |
Unit | Spinner cycles through messages |
test_spinner_stops_cleanly |
Unit | No artifacts left after spinner stops |
test_progress_bar_0_to_100 |
Unit | Progress bar renders at boundaries |
test_syntax_highlight_rust |
Unit | Rust code gets highlighted |
test_syntax_highlight_unknown |
Unit | Unknown language falls back gracefully |
test_output_no_color_mode |
Unit | Respects NO_COLOR env var |
Snapshot: spinner_states.snap |
UI | Spinner looks correct at each frame |
Snapshot: progress_bar.snap |
UI | Progress bar renders correctly |
Snapshot: code_block.snap |
UI | Syntax highlighting looks right |
Edge cases to handle:
- Terminal doesn't support colors (detect and fallback)
NO_COLORenvironment variable (respect it)- Very narrow terminal (truncate gracefully)
- Spinner running when program exits (cleanup)
Stopping condition:
✓ Output is visually styled with colors
✓ Spinner animates smoothly during "thinking"
✓ Progress bar shows 0% to 100% correctly
✓ Code blocks have syntax highlighting
✓ /config opens config file in $EDITOR
✓ Works in terminals without color support
✓ All tests pass, snapshots approved
Goal: Conversations saved automatically, can resume on restart.
Deliverables:
- SpecStory format writer (markdown files)
- Session save after each exchange
- Session loader (parse markdown back to conversation)
- Startup screen with ASCII logo
-
[n]New session option -
[r]Resume last session option - Last session preview (title, time ago)
-
/historycommand to browse past sessions - Update
/clearto save before clearing context
Files to create:
src/
├── cli/
│ ├── startup.rs # Welcome screen
│ └── commands/
│ └── history.rs
├── integrations/
│ ├── mod.rs
│ └── specstory.rs # Save/load sessions
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_session_save_format |
Unit | Output matches SpecStory markdown format |
test_session_load_roundtrip |
Unit | Save then load gives identical conversation |
test_session_load_corrupted |
Unit | Corrupted file gives clear error, doesn't crash |
test_session_filename_format |
Unit | Files named with timestamp correctly |
test_startup_no_previous_session |
Integration | Shows only "New session" if no history |
test_startup_with_previous |
Integration | Shows resume option with preview |
test_history_lists_sessions |
Integration | /history shows all saved sessions |
test_clear_saves_first |
Integration | /clear saves before clearing |
Edge cases to handle:
- No previous sessions (don't show resume option)
- Corrupted session file (skip it, don't crash)
- Very long conversation (pagination in /history)
- Session from different version (version marker in files)
- Disk full when saving (warn but don't crash)
Stopping condition:
✓ On startup, see ASCII logo and session options
✓ Selecting "New" starts fresh conversation
✓ Selecting "Resume" loads previous conversation with full context
✓ Every exchange auto-saves to .specstory/history/
✓ /history shows list of past sessions
✓ /clear saves current session before clearing
✓ Can recover from corrupted session files
✓ All tests pass
Goal: Know exactly how much context you've used and what it costs.
Deliverables:
- Token counting with
tiktoken-rs - Context bar component (bottom of screen)
- Color-coded usage (green/yellow/red)
- Real-time updates after each message
-
/costcommand with detailed breakdown -
/contextcommand showing loaded files, working dir - Cost calculation based on model pricing
Files to create:
src/
├── tokens/
│ ├── mod.rs
│ ├── counter.rs # Token counting
│ └── pricing.rs # Cost calculation
├── ui/
│ └── context_bar.rs
├── cli/commands/
│ ├── cost.rs
│ └── context.rs
Dependencies to add: tiktoken-rs
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_token_count_empty |
Unit | Empty string = 0 tokens |
test_token_count_simple |
Unit | "Hello world" = expected tokens |
test_token_count_code |
Unit | Code snippet = expected tokens |
test_token_count_unicode |
Unit | Unicode/emoji counted correctly |
test_context_bar_color_green |
Unit | 0-60% = green |
test_context_bar_color_yellow |
Unit | 60-85% = yellow |
test_context_bar_color_red |
Unit | 85%+ = red |
test_cost_calculation_opus |
Unit | Claude opus pricing correct |
test_cost_calculation_sonnet |
Unit | Claude sonnet pricing correct |
test_context_bar_at_100_percent |
Unit | Full bar renders correctly |
Snapshot: context_bar_states.snap |
UI | Bar looks right at 25%, 50%, 75%, 100% |
Edge cases to handle:
- Unknown model (default to highest pricing, warn)
- Context overflow (what happens at 100%? warn user)
- Token count changes between models (show warning when switching)
- Very fast typing (debounce updates)
Stopping condition:
✓ Context bar visible at bottom of screen
✓ Bar updates after each message
✓ Colors change at 60% and 85% thresholds
✓ /cost shows accurate token counts and costs
✓ /context shows current working directory and loaded files
✓ Switching models updates context window size
✓ All tests pass, snapshots approved
Goal: Safe by default—ask before writing to untrusted locations.
Deliverables:
- Trusted paths configuration
- Path matching logic (supports
~, globs) - Permission check before file write/modify
- Permission prompt UI (Y/n/always/never)
- "Always" adds to trusted paths in config
- Read operations always allowed
- Permission decision caching (per-session)
Files to create:
src/
├── permissions/
│ ├── mod.rs
│ ├── trusted.rs # Path matching
│ └── prompt.rs # Permission UI
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_trusted_path_exact_match |
Unit | Exact path matches |
test_trusted_path_subdirectory |
Unit | Child of trusted path is trusted |
test_trusted_path_tilde_expansion |
Unit | ~/foo expands correctly |
test_trusted_path_glob |
Unit | ~/projects/* matches children |
test_untrusted_path_prompts |
Integration | Writing to untrusted triggers prompt |
test_trusted_path_no_prompt |
Integration | Writing to trusted doesn't prompt |
test_always_adds_to_config |
Integration | "Always" choice persists to config |
test_read_always_allowed |
Unit | Read operations never prompt |
Edge cases to handle:
- Symlinks (resolve to real path before checking)
- Relative paths (resolve to absolute)
- Path traversal attempts (
../../../etc/passwd) - Permission denied by OS (handle gracefully)
- Config file itself (always writable)
Stopping condition:
✓ Writing to trusted paths works without prompts
✓ Writing to untrusted paths shows permission prompt
✓ "Always" option adds path to trusted_paths in config
✓ "Never" option blocks and remembers for session
✓ Symlinks resolved correctly
✓ All tests pass
Goal: Tools that fix their own failures and write tests to prevent recurrence.
Deliverables:
- Tool executor framework
- Error categorization (code, permission, network, resource)
- Fix-agent spawning for code errors
- Diagnostic analysis (parse compiler errors)
- Auto-fix application
- Regression test generation
- Retry logic with backoff for network errors
- Tool execution spinner with status
Files to create:
src/
├── tools/
│ ├── mod.rs
│ ├── executor.rs # Run tools, handle errors
│ ├── recovery.rs # Error recovery strategies
│ └── diagnostics.rs # Parse error messages
├── agents/
│ ├── mod.rs
│ └── fix_agent.rs # Self-healing agent
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_error_categorization_missing_dep |
Unit | "cannot find crate" = code error |
test_error_categorization_permission |
Unit | "permission denied" = permission error |
test_error_categorization_network |
Unit | "connection refused" = network error |
test_fix_missing_dependency |
Integration | Missing dep auto-added to Cargo.toml |
test_fix_missing_import |
Integration | Missing import auto-added |
test_generated_test_compiles |
Integration | Auto-generated test is valid Rust |
test_retry_backoff_timing |
Unit | Exponential backoff works correctly |
test_max_retries_exceeded |
Unit | Gives up after max attempts |
test_fix_agent_no_infinite_loop |
Integration | Doesn't loop forever on unfixable error |
Edge cases to handle:
- Unfixable error (give up after N attempts, report clearly)
- Fix introduces new error (detect loops, abort)
- Multiple errors at once (fix in order of dependency)
- Generated test fails (don't commit broken test)
- No write permission for fix (fall back to suggestion)
Stopping condition:
✓ Missing dependency auto-fixed with test generated
✓ Missing import auto-added
✓ Network errors retry with backoff
✓ Unfixable errors reported clearly after max attempts
✓ Fix loops detected and aborted
✓ All tests pass
Goal: Connect Claude to actual coding tools so it can read, write, and execute code.
Deliverables:
- Wire tools from
coding-agent-coreinto CLI's API calls -
read_filetool - Read file contents -
write_filetool - Create/overwrite files -
edit_filetool - Make targeted edits to existing files -
list_filestool - List directory contents -
bashtool - Execute shell commands -
code_searchtool - Search codebase with ripgrep patterns - Tool result display in REPL (formatted, syntax highlighted)
- Tool call visibility (show "Reading src/main.rs..." etc.)
- Permission checks before write operations (use Phase 6 system)
Files to modify:
src/
├── cli/
│ └── repl.rs # Add tools to API calls, display tool results
├── tools/
│ ├── mod.rs # Re-export tool definitions
│ └── definitions.rs # Tool schemas for Claude API
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_read_file_tool |
Integration | Claude can read file and discuss contents |
test_write_file_tool |
Integration | Claude can create new file |
test_edit_file_tool |
Integration | Claude can modify existing file |
test_bash_tool |
Integration | Claude can run shell commands |
test_list_files_tool |
Integration | Claude can explore directory structure |
test_code_search_tool |
Integration | Claude can search for patterns |
test_tool_permission_denied |
Integration | Write to untrusted path prompts user |
test_tool_result_display |
Unit | Tool results formatted correctly |
test_multi_tool_conversation |
Integration | Claude uses multiple tools in sequence |
Edge cases to handle:
- Large file (truncate with message, or refuse)
- Binary file (detect and refuse to read as text)
- File doesn't exist (clear error message)
- Permission denied (use permission system)
- Tool timeout (bash commands that hang)
- Dangerous commands (rm -rf, etc.) - warn or block
Stopping condition:
✓ "Read src/main.rs and summarize it" → Claude reads and discusses file
✓ "Create a hello.txt file with 'Hello World'" → File created
✓ "Run cargo test" → Tests execute, output shown
✓ "Find all TODO comments in the codebase" → Search results displayed
✓ Tool calls shown with status (● Reading... ✓ Read 150 lines)
✓ Write operations respect permission system
✓ All tests pass
Goal: Smart commits with purpose-focused messages.
Deliverables:
- Git status reading with
git2 -
/commitcommand (agent decides what to commit) -
/commit --pick(interactive file picker) - Smart file grouping (logically related changes)
- 3-sentence purpose-focused commit message generation
- Commit message preview and edit
-
/diffcommand -
/undocommand (revert last commit or file change)
Files to create:
src/
├── integrations/
│ └── git.rs
├── cli/commands/
│ ├── commit.rs
│ ├── diff.rs
│ └── undo.rs
Dependencies to add: git2
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_git_status_clean |
Integration | Clean repo returns empty changes |
test_git_status_modified |
Integration | Modified files detected |
test_git_status_untracked |
Integration | Untracked files detected |
test_file_grouping_same_dir |
Unit | Files in same dir grouped |
test_file_grouping_related |
Unit | test + impl grouped together |
test_commit_message_format |
Unit | Message has title + 2 body sentences |
test_commit_creates_commit |
Integration | Git log shows new commit |
test_undo_reverts_commit |
Integration | Undo removes last commit |
test_diff_shows_changes |
Integration | /diff shows staged + unstaged |
test_pick_mode_ui |
Integration | Checkboxes work correctly |
Edge cases to handle:
- Not in a git repo (helpful error message)
- Merge conflict in progress (detect and warn)
- Detached HEAD (warn before committing)
- No changes to commit (don't create empty commit)
- Commit hook fails (report error, don't retry)
- Binary files (warn, skip from message analysis)
Stopping condition:
✓ /commit analyzes changes and commits with good message
✓ /commit --pick shows interactive file picker
✓ Commit message follows 3-sentence purpose format
✓ /diff shows current changes
✓ /undo reverts last change
✓ Handles edge cases (no repo, no changes, conflicts)
✓ All tests pass
Goal: Full workflow support for specs, docs, and model switching.
Deliverables:
-
/spec <name>- create spec file, enter planning mode - Planning mode (different prompt behavior)
- Spec file template generation
-
/document <topic>- Obsidian integration - Vault search functionality
- Note creation and update
-
/model [name]- switch AI model - Model availability validation
Files to create:
src/
├── integrations/
│ └── obsidian.rs
├── cli/commands/
│ ├── spec.rs
│ ├── document.rs
│ └── model.rs
├── cli/
│ └── modes.rs # Normal vs planning mode
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_spec_creates_file |
Integration | /spec foo creates specs/foo.md |
test_spec_template_valid |
Unit | Generated template has required sections |
test_spec_existing_file |
Integration | Opening existing spec loads content |
test_document_search |
Integration | Finds notes by content |
test_document_create |
Integration | Creates note in correct location |
test_document_update |
Integration | Updates existing note |
test_model_switch_valid |
Unit | Switching to valid model works |
test_model_switch_invalid |
Unit | Invalid model shows error with suggestions |
test_planning_mode_indicator |
UI | Planning mode shows visual indicator |
Edge cases to handle:
- Spec file already exists (load it, don't overwrite)
- Obsidian vault not configured (helpful setup message)
- Vault doesn't exist (create? or error?)
- Multiple matching notes (show picker)
- Model not available (suggest alternatives)
- Planning mode with no spec file (error)
Stopping condition:
✓ /spec creates file and enters planning mode
✓ Planning mode shows visual indicator
✓ /document searches and finds related notes
✓ /document creates new notes with proper structure
✓ /model switches model and updates context window
✓ All tests pass
Goal: Delightful experience with entertainment during waits.
Deliverables:
- Fun facts API integration (
reqwest) - Fact caching (local storage)
- Fallback to curated list
- Thinking messages rotation
- Long-wait detection (>10s)
- Fun fact display during long waits
- Configurable enable/disable
-
/statuscommand (active tasks, running agents)
Files to create:
src/
├── ui/
│ ├── thinking.rs
│ └── fun_facts.rs
├── cli/commands/
│ └── status.rs
Dependencies to add: reqwest
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_api_fetch_success |
Integration | API returns valid fact (mocked) |
test_api_fetch_timeout |
Integration | Timeout falls back to cache |
test_api_fetch_error |
Integration | Error falls back to cache |
test_cache_save_load |
Unit | Cache persists and loads |
test_cache_expiry |
Unit | Old cache items refreshed |
test_thinking_messages_rotate |
Unit | Messages cycle, don't repeat consecutively |
test_long_wait_threshold |
Unit | Fun fact shown only after 10s |
test_fun_facts_disabled |
Unit | No facts when config disabled |
test_status_shows_tasks |
Integration | /status lists active operations |
Edge cases to handle:
- No internet (use cached facts)
- Cache empty and no internet (use hardcoded fallback)
- API rate limited (respect limits, use cache)
- Very fast operation (don't flash fact then hide it)
- User disabled fun facts (respect config)
Stopping condition:
✓ Thinking messages rotate during waits
✓ Fun facts appear after 10+ seconds
✓ Works offline with cached facts
✓ Facts disabled via config works
✓ /status shows current operations
✓ All tests pass
Goal: Visual orchestration of parallel agent tasks.
Deliverables:
- Agent manager (spawn, track, cancel agents)
- Agent state machine (queued → running → complete/failed)
- Multi-agent status bar UI
- Progress tracking per agent
- Agent result aggregation
- Cancel individual agents
- Parallel execution with
tokio
Files to create:
src/
├── agents/
│ ├── manager.rs
│ └── status.rs
├── ui/
│ └── status_bar.rs
Tests:
| Test | Type | What it verifies |
|---|---|---|
test_agent_lifecycle |
Unit | queued → running → complete |
test_agent_cancellation |
Unit | Cancel stops agent, updates status |
test_parallel_execution |
Integration | Multiple agents run concurrently |
test_status_bar_single_agent |
UI | One agent shows correctly |
test_status_bar_multiple_agents |
UI | Multiple agents stacked correctly |
test_status_bar_progress |
UI | Progress bar updates |
test_agent_failure_handling |
Unit | Failed agent shows error state |
test_result_aggregation |
Unit | Results from multiple agents combined |
Snapshot: status_bar_states.snap |
UI | Various states render correctly |
Edge cases to handle:
- Agent crashes (mark failed, don't crash CLI)
- All agents cancelled (clean state)
- Too many agents to display (pagination/scrolling)
- Agent stuck (timeout detection)
- Memory pressure (limit concurrent agents)
Stopping condition:
✓ Status bar shows all running agents
✓ Progress updates in real-time
✓ Agents run in parallel
✓ Can cancel individual agents
✓ Failed agents show error state
✓ Status bar handles many agents gracefully
✓ All tests pass, snapshots approved
Goal: Full integration testing that validates the entire CLI works correctly—graphics, LLM interactions, tool calls, commands, and user workflows.
Testing Strategy Overview:
This phase establishes a comprehensive test suite using multiple testing approaches:
- Interactive PTY Testing - Simulate real terminal sessions
- Snapshot Testing - Capture and compare terminal output
- Mock LLM Server - Test API interactions without costs
- Workflow Integration - End-to-end user journey tests
Deliverables:
- PTY test harness using
expectrlorrexpect - Mock Claude API server for deterministic testing
- Terminal output snapshot tests with
insta+term-transcript - Full workflow integration tests
- Visual regression tests for UI components
- Performance benchmarks for startup and response times
- CI/CD integration with test matrix
Dependencies to add:
[dev-dependencies]
expectrl = "0.7" # PTY-based interactive testing
insta = "1.34" # Snapshot testing
term-transcript = "0.3" # Terminal output capture with ANSI
assert_cmd = "2.0" # CLI subprocess testing
predicates = "3.0" # Output assertions
wiremock = "0.6" # Mock HTTP server for API
tokio-test = "0.4" # Async test utilitiesFiles to create:
tests/
├── e2e/
│ ├── mod.rs
│ ├── harness.rs # Test harness setup (mock server, PTY)
│ ├── mock_claude.rs # Mock Claude API responses
│ ├── pty_helpers.rs # PTY interaction utilities
│ │
│ ├── startup_test.rs # Startup screen tests
│ ├── input_test.rs # Multi-line input, double-enter
│ ├── commands_test.rs # All slash commands
│ ├── conversation_test.rs # Multi-turn LLM interactions
│ ├── tools_test.rs # Tool execution flows
│ ├── session_test.rs # Save/resume workflows
│ ├── context_bar_test.rs # Token tracking display
│ ├── error_recovery_test.rs # Self-healing flows
│ └── full_workflow_test.rs # Complete user journeys
│
├── snapshots/ # insta snapshot files
│ ├── startup_screen.snap
│ ├── help_output.snap
│ ├── context_bar_states.snap
│ ├── tool_execution.snap
│ └── error_messages.snap
│
└── fixtures/
├── mock_responses/ # Canned Claude API responses
│ ├── simple_response.json
│ ├── tool_call_response.json
│ ├── multi_turn_conversation.json
│ └── streaming_chunks.json
└── test_projects/ # Minimal test codebases
└── rust_project/
├── Cargo.toml
└── src/main.rs
Purpose: Test interactive terminal behavior in a real pseudo-terminal.
Implementation:
// tests/e2e/harness.rs
use expectrl::{Session, Eof};
use std::time::Duration;
pub struct CliTestSession {
session: Session,
}
impl CliTestSession {
pub fn spawn() -> Result<Self, Box<dyn std::error::Error>> {
let session = expectrl::spawn("cargo run -p coding-agent-cli")?;
Ok(Self { session })
}
pub fn expect_startup_screen(&mut self) -> Result<(), Box<dyn std::error::Error>> {
// Wait for ASCII logo
self.session.expect("CODE")?;
self.session.expect("[n] New session")?;
self.session.expect("[r] Resume last session")?;
Ok(())
}
pub fn select_new_session(&mut self) -> Result<(), Box<dyn std::error::Error>> {
self.session.send_line("n")?;
self.session.expect(">")?; // Wait for prompt
Ok(())
}
pub fn send_message(&mut self, msg: &str) -> Result<(), Box<dyn std::error::Error>> {
self.session.send_line(msg)?;
self.session.send_line("")?; // Double-enter to submit
Ok(())
}
pub fn expect_response(&mut self, timeout: Duration) -> Result<String, Box<dyn std::error::Error>> {
self.session.set_expect_timeout(Some(timeout));
let output = self.session.expect(">")?; // Wait for next prompt
Ok(String::from_utf8_lossy(output.as_bytes()).to_string())
}
pub fn run_command(&mut self, cmd: &str) -> Result<String, Box<dyn std::error::Error>> {
self.session.send_line(cmd)?;
self.session.send_line("")?;
self.expect_response(Duration::from_secs(5))
}
}Tests:
| Test | What it verifies |
|---|---|
test_pty_startup_displays_logo |
ASCII art renders correctly in PTY |
test_pty_double_enter_submits |
Input only submits on double-enter |
test_pty_ctrl_c_cancels |
Ctrl+C clears current input |
test_pty_ctrl_d_exits |
Ctrl+D exits cleanly |
test_pty_terminal_resize |
UI reflows on resize |
test_pty_raw_mode_cleanup |
Terminal restored after exit/crash |
test_pty_unicode_rendering |
Emoji and CJK display correctly |
test_pty_long_output_scrolls |
Long responses scroll properly |
Purpose: Test LLM interactions without hitting real API (cost-free, deterministic).
Implementation:
// tests/e2e/mock_claude.rs
use wiremock::{MockServer, Mock, ResponseTemplate};
use wiremock::matchers::{method, path, header};
pub struct MockClaudeServer {
server: MockServer,
}
impl MockClaudeServer {
pub async fn start() -> Self {
let server = MockServer::start().await;
Self { server }
}
pub fn url(&self) -> String {
self.server.uri()
}
pub async fn mock_simple_response(&self, content: &str) {
Mock::given(method("POST"))
.and(path("/v1/messages"))
.respond_with(ResponseTemplate::new(200).set_body_json(serde_json::json!({
"id": "msg_test123",
"type": "message",
"role": "assistant",
"content": [{"type": "text", "text": content}],
"model": "claude-3-opus-20240229",
"stop_reason": "end_turn",
"usage": {"input_tokens": 10, "output_tokens": 20}
})))
.mount(&self.server)
.await;
}
pub async fn mock_tool_call(&self, tool_name: &str, tool_input: serde_json::Value) {
Mock::given(method("POST"))
.and(path("/v1/messages"))
.respond_with(ResponseTemplate::new(200).set_body_json(serde_json::json!({
"id": "msg_test456",
"type": "message",
"role": "assistant",
"content": [{
"type": "tool_use",
"id": "tool_call_123",
"name": tool_name,
"input": tool_input
}],
"model": "claude-3-opus-20240229",
"stop_reason": "tool_use",
"usage": {"input_tokens": 15, "output_tokens": 30}
})))
.mount(&self.server)
.await;
}
pub async fn mock_streaming_response(&self, chunks: Vec<&str>) {
// SSE streaming format
let body = chunks.iter()
.map(|chunk| format!("data: {{\"type\":\"content_block_delta\",\"delta\":{{\"text\":\"{}\"}}}}\n\n", chunk))
.collect::<String>();
Mock::given(method("POST"))
.and(path("/v1/messages"))
.and(header("accept", "text/event-stream"))
.respond_with(ResponseTemplate::new(200)
.set_body_string(body)
.insert_header("content-type", "text/event-stream"))
.mount(&self.server)
.await;
}
pub async fn mock_rate_limit(&self) {
Mock::given(method("POST"))
.and(path("/v1/messages"))
.respond_with(ResponseTemplate::new(429)
.set_body_json(serde_json::json!({
"type": "error",
"error": {"type": "rate_limit_error", "message": "Rate limited"}
})))
.mount(&self.server)
.await;
}
pub async fn mock_network_error(&self) {
Mock::given(method("POST"))
.and(path("/v1/messages"))
.respond_with(ResponseTemplate::new(500))
.mount(&self.server)
.await;
}
}Tests:
| Test | What it verifies |
|---|---|
test_api_simple_conversation |
Basic request/response flow |
test_api_multi_turn_context |
Conversation history sent correctly |
test_api_tool_call_execution |
Tool calls parsed and executed |
test_api_tool_result_sent |
Tool results returned to Claude |
test_api_streaming_display |
Streaming responses render in real-time |
test_api_rate_limit_retry |
429 triggers exponential backoff |
test_api_network_error_recovery |
Network errors handled gracefully |
test_api_token_counting_accurate |
Usage tracked matches API response |
Purpose: Capture terminal output including ANSI colors for visual regression testing.
Implementation:
// tests/e2e/snapshots.rs
use term_transcript::{Transcript, UserInput, ShellOptions};
use insta::assert_snapshot;
#[test]
fn test_startup_screen_snapshot() {
let transcript = Transcript::from_inputs(
ShellOptions::default().with_cargo_path(),
vec![UserInput::command("cargo run -p coding-agent-cli")],
).unwrap();
// Capture terminal output as text
let output = transcript.to_string();
insta::assert_snapshot!("startup_screen", output);
}
#[test]
fn test_help_command_snapshot() {
let transcript = Transcript::from_inputs(
ShellOptions::default(),
vec![
UserInput::command("cargo run -p coding-agent-cli"),
UserInput::command("n"), // New session
UserInput::command("/help"),
UserInput::command(""), // Double-enter
],
).unwrap();
insta::assert_snapshot!("help_output", transcript.to_string());
}
#[test]
fn test_context_bar_colors() {
// Test at 25%, 60%, 85% thresholds
// Verify green → yellow → red transitions
}
#[test]
fn test_error_message_formatting() {
// Verify error messages are red, properly formatted
}Snapshot files generated:
startup_screen.snap- ASCII logo and session optionshelp_output.snap- All available commandscontext_bar_25.snap- Green bar at low usagecontext_bar_70.snap- Yellow bar at medium usagecontext_bar_90.snap- Red bar at high usagetool_execution.snap- "● Reading..." / "✓ Read 150 lines"error_permission.snap- Permission denied errorerror_unknown_command.snap- Unknown command suggestion
Purpose: Verify all slash commands work correctly.
Tests:
| Test | Command | What it verifies |
|---|---|---|
test_cmd_help |
/help |
Lists all commands with descriptions |
test_cmd_clear |
/clear |
Clears screen, resets context, saves session |
test_cmd_exit |
/exit |
Exits cleanly |
test_cmd_quit_alias |
/quit, /q |
Aliases work |
test_cmd_config |
/config |
Opens editor with config file |
test_cmd_history |
/history |
Lists past sessions |
test_cmd_cost |
/cost |
Shows token breakdown and costs |
test_cmd_context |
/context |
Shows loaded files and token usage |
test_cmd_commit |
/commit |
Analyzes changes, generates message |
test_cmd_commit_pick |
/commit --pick |
Shows file picker |
test_cmd_diff |
/diff |
Shows staged/unstaged changes |
test_cmd_undo |
/undo |
Reverts last change |
test_cmd_spec |
/spec auth |
Creates spec file, enters planning mode |
test_cmd_document |
/document topic |
Searches/creates Obsidian notes |
test_cmd_model |
/model sonnet |
Switches model |
test_cmd_status |
/status |
Shows active tasks/agents |
test_cmd_unknown |
/foobar |
Shows helpful error message |
test_cmd_bare_slash |
/ |
Shows error, not sent as message |
Purpose: Test complete user journeys from start to finish.
Test scenarios:
#[tokio::test]
async fn test_workflow_new_session_conversation() {
// 1. Start CLI
// 2. Select new session
// 3. Send message, get response
// 4. Send follow-up, verify context retained
// 5. Check /cost shows token usage
// 6. Exit with /exit
// 7. Verify session saved to .specstory/
}
#[tokio::test]
async fn test_workflow_resume_session() {
// 1. Create a session with conversation
// 2. Exit
// 3. Restart CLI
// 4. Select resume
// 5. Verify previous context loaded
// 6. Continue conversation
}
#[tokio::test]
async fn test_workflow_tool_execution() {
// 1. Ask Claude to read a file
// 2. Verify tool call displayed ("● Reading...")
// 3. Verify tool result displayed ("✓ Read 150 lines")
// 4. Verify Claude summarizes file contents
}
#[tokio::test]
async fn test_workflow_self_healing() {
// 1. Run a command that fails (missing dependency)
// 2. Verify error categorized correctly
// 3. Verify fix-agent spawned
// 4. Verify fix applied
// 5. Verify regression test generated
// 6. Verify retry succeeds
}
#[tokio::test]
async fn test_workflow_git_commit() {
// Setup: Create test repo with changes
// 1. Run /commit
// 2. Verify agent analyzes changes
// 3. Verify commit message generated
// 4. Verify preview shown
// 5. Confirm commit
// 6. Verify git log shows commit
}
#[tokio::test]
async fn test_workflow_permission_prompt() {
// 1. Ask Claude to write to untrusted path
// 2. Verify permission prompt appears
// 3. Select "Always"
// 4. Verify added to config
// 5. Verify subsequent writes don't prompt
}
#[tokio::test]
async fn test_workflow_long_wait_fun_facts() {
// 1. Mock slow API response (>10s)
// 2. Verify thinking messages rotate
// 3. Verify fun fact appears after threshold
// 4. Verify response eventually displays
}Purpose: Ensure CLI is responsive and doesn't regress.
Benchmarks:
| Metric | Target | Test |
|---|---|---|
| Startup time | <500ms | bench_startup_to_prompt |
| Input latency | <16ms | bench_keypress_to_display |
| Token counting | <10ms | bench_token_count_10k |
| Context bar update | <5ms | bench_context_bar_render |
| Session save | <100ms | bench_session_save_large |
| Session load | <200ms | bench_session_load_large |
Implementation:
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_startup(c: &mut Criterion) {
c.bench_function("startup_to_prompt", |b| {
b.iter(|| {
let mut session = CliTestSession::spawn().unwrap();
session.expect_startup_screen().unwrap();
});
});
}
fn bench_token_counting(c: &mut Criterion) {
let large_text = "word ".repeat(2000); // ~10k tokens
c.bench_function("token_count_10k", |b| {
b.iter(|| {
count_tokens(&large_text)
});
});
}
criterion_group!(benches, bench_startup, bench_token_counting);
criterion_main!(benches);GitHub Actions workflow:
# .github/workflows/e2e.yml
name: E2E Tests
on: [push, pull_request]
jobs:
e2e:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
steps:
- uses: actions/checkout@v4
- name: Install Rust
uses: dtolnay/rust-action@stable
- name: Cache cargo
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
- name: Run E2E tests
run: cargo test --test e2e -- --test-threads=1
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY_TEST }}
# Or use mock server:
ANTHROPIC_BASE_URL: http://localhost:8080
- name: Run snapshot tests
run: cargo insta test --review
- name: Upload snapshots on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: failed-snapshots-${{ matrix.os }}
path: tests/snapshots/*.snap.new
performance:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-action@stable
- name: Run benchmarks
run: cargo bench -- --noplot
- name: Check for regressions
run: |
# Compare against baseline
cargo bench -- --baseline main --noplotTests Summary:
| Category | Test Count | Tools Used |
|---|---|---|
| PTY Interactive | 8 | expectrl |
| Mock API | 8 | wiremock |
| Terminal Snapshots | 10+ | insta, term-transcript |
| Command Tests | 18 | assert_cmd |
| Workflow Integration | 7 | Combined |
| Performance | 6 | criterion |
| Total | ~57 |
Edge cases to handle:
- CI environment has no TTY (use PTY emulation)
- Mock server port conflicts (use dynamic ports)
- Snapshot tests on different terminal widths (normalize output)
- Flaky tests from timing issues (add retries, increase timeouts)
- Platform-specific behavior (test matrix for macOS/Linux/Windows)
Stopping condition:
✓ All PTY tests pass on all platforms
✓ Mock server tests cover happy path and error cases
✓ Snapshots reviewed and approved
✓ All commands tested
✓ Full workflows pass end-to-end
✓ Performance within targets
✓ CI pipeline green on all platforms
✓ No flaky tests
Goal: Wire up all implemented-but-unused modules to eliminate compiler warnings and activate advanced features.
Context: Phases 1-13 implemented all the code for the CLI specification. However, many advanced features were built as complete modules but not yet integrated into the main execution flow. This phase systematically wires everything together.
Deliverables:
-
AgentManager Integration
- Create
AgentManagerinstance inRepl::new() - Store as
Arc<AgentManager>field inReplstruct - Pass to
CommandContextso all commands can access it - Update
/statuscommand to query and display agent status
- Create
-
Permission System Activation
- Wire
PermissionPromptintoPermissionChecker - When write operation hits untrusted path, show interactive prompt
- Handle "Always" responses by updating config file
- Handle "Never" responses by caching in session permissions
- Wire
Files to modify:
src/cli/repl.rs # Add agent_manager field
src/tools/definitions.rs # Pass permission checker through
src/permissions/checker.rs # Call PermissionPrompt::prompt()
Tests:
| Test | What it verifies |
|---|---|
test_agent_manager_initialized |
AgentManager exists in REPL |
test_status_shows_active_agents |
/status displays running agents |
test_permission_prompt_untrusted |
Prompt shown for untrusted writes |
test_permission_always_saves |
"Always" updates config |
Stopping condition:
✓ AgentManager successfully initialized
✓ /status shows "No active agents" on fresh start
✓ Permission prompts appear for untrusted file writes
✓ Config updated when "Always" selected
Deliverables:
-
ToolExecutor Integration
- Create
ToolExecutorinstance inRepl::new() - Register all tools (read, write, edit, list, bash, search)
- Replace
execute_tool_with_permissions()calls withexecutor.execute() - Handle
ToolExecutionResultwith retry logic - Display retry attempts to user
- Create
-
ToolExecutionSpinner Integration
- Replace simple "● Running..." with
ToolExecutionSpinner::with_target() - Show spinner during tool execution
- Call
finish_success()orfinish_failed()on completion - Display elapsed time for long operations
- Replace simple "● Running..." with
Implementation approach:
// In process_conversation(), replace:
self.print_line(&format!("● {}", self.format_tool_call(&name, &input)));
let result = execute_tool_with_permissions(...);
// With:
let spinner = ToolExecutionSpinner::with_target(
&name,
extract_target(&name, &input),
self.theme.clone()
);
let result = self.tool_executor.execute(id.clone(), &name, input);
if result.is_success() {
spinner.finish_success();
} else {
spinner.finish_failed(result.error().unwrap());
}Tests:
| Test | What it verifies |
|---|---|
test_tool_executor_retry_transient |
Network errors retried 3 times |
test_tool_spinner_shows_target |
Spinner displays file path |
test_tool_spinner_timing |
Elapsed time shown on completion |
Stopping condition:
✓ ToolExecutor handles all tool execution
✓ Spinners animate during tool calls
✓ Retry attempts displayed to user
✓ Success/failure with timing shown
Deliverables:
-
FixAgent Integration
- Check
ToolExecutionResult::is_auto_fixable()after errors - Spawn
FixAgent::spawn()for fixable errors - Display fix agent progress via status bar
- Apply fix if successful, re-run original tool
- Generate regression test via
generate_regression_test()
- Check
-
Error Recovery Flow
- Categorize errors via
ToolError::category - For
ErrorCategory::Code, attempt auto-fix - For
ErrorCategory::Network, retry with backoff (already in ToolExecutor) - For
ErrorCategory::Permission, show permission prompt - For
ErrorCategory::Resource, suggest alternatives
- Categorize errors via
Implementation approach:
// In process_conversation(), after tool execution:
match result.result {
Ok(output) => { /* success path */ },
Err(error) if error.is_auto_fixable() => {
// Spawn fix agent
if let Some(fix_agent) = FixAgent::spawn(result, FixAgentConfig::default()) {
let fix_result = fix_agent.attempt_fix(
|fix_info| apply_fix(fix_info, &config),
|_| verify_fix()
);
if fix_result.success {
// Re-run the original tool
let retry_result = self.tool_executor.execute(...);
// ... handle retry
}
}
},
Err(error) => { /* show error */ }
}Tests:
| Test | What it verifies |
|---|---|
test_auto_fix_missing_dependency |
Adds missing crate to Cargo.toml |
test_auto_fix_missing_import |
Adds missing import statement |
test_regression_test_generated |
Test written after successful fix |
test_no_fix_infinite_loop |
Gives up after max attempts |
Stopping condition:
✓ Missing dependency errors trigger auto-fix
✓ Fix agent progress shown in /status
✓ Regression tests generated in tests/auto_fix/
✓ Original operation retried after fix
✓ Max retry limit prevents infinite loops
Deliverables:
-
StatusBar for Multi-Agent Display
- Create
StatusBarinstance - Query
agent_manager.get_all_statuses()periodically - Render status bar above input prompt when agents active
- Clear status bar when all agents complete
- Create
-
LongWaitDetector Integration
- Replace manual
elapsed.as_secs()check withLongWaitDetector - Call
detector.start()before API call - Call
detector.check()periodically - Trigger fun fact on first threshold crossing
- Replace manual
-
Enhanced Tool Result Formatting
- Use
ToolResultFormatter::format_result()consistently - Add syntax highlighting for code in results
- Implement collapsible sections for long outputs
- Add line numbers to file reads
- Use
Implementation approach:
// Status bar rendering
if !agent_statuses.is_empty() {
self.status_bar.render(&agent_statuses)?;
// Move cursor below status bar
execute!(stdout(), cursor::MoveDown(agent_statuses.len() as u16 + 2))?;
}
// Long wait detection
let mut detector = LongWaitDetector::new(Duration::from_secs(self.fun_fact_delay as u64));
detector.start();
// ... start API call ...
// Periodically check during wait
if detector.check() && !triggered {
self.display_fun_fact();
triggered = true;
}Tests:
| Test | What it verifies |
|---|---|
test_status_bar_multiple_agents |
Displays all active agents |
test_status_bar_progress_updates |
Progress bars update correctly |
test_long_wait_detector_threshold |
Fun fact triggers after 10s |
test_tool_result_syntax_highlight |
Code highlighted in results |
Stopping condition:
✓ Status bar appears when agents spawn
✓ Progress bars update in real-time
✓ Fun facts trigger at exact threshold
✓ Tool results nicely formatted with syntax highlighting
Deliverables:
-
File Grouping in /commit
- Use
FileGrouper::group_files()in commit command - Suggest logical splits for unrelated changes
- Show grouping rationale to user
- Ask which group to commit
- Use
-
Smart Commit Message Generation
- Analyze all files in group via
suggest_commit_splits() - Generate purpose-focused message from changes
- Show commit message preview with edit option
- Support multi-commit workflow for logical separation
- Analyze all files in group via
Implementation approach:
// In commit.rs
let repo = GitRepo::open(".")?;
let status = repo.status()?;
let files = status.files;
// Group related files
let groups = FileGrouper::group_files(&files);
if groups.len() > 1 {
// Suggest splits
println!("Found {} logical groups:", groups.len());
for (i, group) in groups.iter().enumerate() {
println!(" {}. {} ({} files)", i+1, group.reason.description(), group.files.len());
}
// Let user pick which to commit
}Tests:
| Test | What it verifies |
|---|---|
test_file_grouping_same_dir |
Files grouped by directory |
test_file_grouping_test_impl |
Tests paired with implementations |
test_commit_split_suggestion |
Suggests separating unrelated changes |
test_commit_message_purpose |
Message focuses on "why" not "what" |
Stopping condition:
✓ /commit suggests logical file groups
✓ User can commit subsets of changes
✓ Commit messages explain purpose clearly
✓ Test/impl files grouped intelligently
Deliverables:
-
/document Command Enhancement
- Use
ObsidianVault::search()to find related notes - Show search results with relevance scores
- Support creating new notes in suggested locations
- Support updating existing notes with diffs
- Use
-
Note Template System
- Generate structured note templates by topic
- Include metadata (date, tags, backlinks)
- Support different note types (meeting, concept, reference)
Tests:
| Test | What it verifies |
|---|---|
test_document_search_finds_related |
Finds notes by content |
test_document_creates_in_location |
Suggests correct subdirectory |
test_document_shows_diff |
Preview changes before applying |
Stopping condition:
✓ /document finds existing notes
✓ New notes created in appropriate subdirs
✓ Updates show diff preview
✓ Metadata included in new notes
Deliverables:
-
Build Verification
cargo build --all-targets cargo test --all cargo clippy --all-targets -- -D warnings cargo fmt --all -- --check -
Integration Test Suite
- Test agent spawning and status
- Test tool execution with retries
- Test auto-fix workflow end-to-end
- Test permission prompts and config updates
- Test multi-agent coordination
-
Manual Workflow Tests
- Spawn multiple agents, check /status
- Trigger permission prompt, select "Always"
- Cause tool error, verify auto-fix attempt
- Let long operation run, see fun fact
- Commit with file grouping
Test scenarios:
#[tokio::test]
async fn test_end_to_end_auto_fix() {
// 1. Set up project missing a dependency
// 2. Ask Claude to use that dependency
// 3. Tool execution fails
// 4. Verify FixAgent spawned
// 5. Verify dependency added
// 6. Verify regression test created
// 7. Verify retry succeeded
}
#[tokio::test]
async fn test_multi_agent_status_display() {
// 1. Spawn 3 test agents
// 2. Run /status
// 3. Verify all 3 shown with progress
// 4. Cancel one agent
// 5. Run /status again
// 6. Verify only 2 active
}
#[tokio::test]
async fn test_permission_prompt_workflow() {
// 1. Attempt write to /tmp/test.txt
// 2. Verify prompt appears
// 3. Simulate "Always" response
// 4. Verify config updated
// 5. Attempt another write to /tmp/
// 6. Verify no prompt (cached)
}Stopping condition:
✓ cargo build produces NO warnings
✓ All 210+ tests pass
✓ Clippy produces no warnings
✓ Code formatted consistently
✓ All integration tests pass
✓ Manual workflows verified
Deliverables:
-
Code Documentation
- Add doc comments to all public APIs
- Document integration points between modules
- Add examples to complex functions
- Update README with new features
-
Architecture Documentation
- Update
docs/CLI_ARCHITECTURE.mdwith final structure - Document agent lifecycle and coordination
- Document error recovery flow
- Document tool execution pipeline
- Update
-
Remove Temporary Allows
- Remove any
#[allow(dead_code)]annotations added temporarily - Verify all code is actually used
- Clean up any debug prints or temporary code
- Remove any
Stopping condition:
✓ All public APIs documented
✓ Architecture docs updated
✓ No #[allow(dead_code)] annotations
✓ No TODO or FIXME comments
✓ README reflects actual features
A complete integration when:
□ All modules wired into main execution flow
□ Zero compiler warnings on `cargo build`
□ Zero clippy warnings on `cargo clippy`
□ All 210+ tests pass
□ New integration tests added and passing
□ Manual workflow tests verified
□ Code formatted with rustfmt
□ Documentation updated
□ All advanced features functional:
□ Self-healing error recovery
□ Multi-agent coordination
□ Permission prompts with config updates
□ Tool execution with retries and spinners
□ Smart git commit grouping
□ Obsidian vault integration
□ Status bar for agent progress
□ Long-wait fun facts
For each feature, handle edge cases with this priority:
- Prevent data loss - Never lose user's work (always save before destructive operations)
- Graceful degradation - Feature unavailable? Fall back to simpler version
- Clear communication - Tell user what went wrong and how to fix it
- No crashes - Catch all errors, log them, keep CLI running
- Recoverable state - After any error, CLI should be usable
Universal edge cases (handle in Phase 1-2):
- Terminal resize → reflow UI
- SIGINT/SIGTERM → save state, exit cleanly
- Panic → restore terminal, save state, show error
- Disk full → warn, don't crash
- Permission denied → explain which permission, suggest fix
A phase is complete when:
□ All deliverables implemented
□ All tests written and passing
□ Snapshot tests reviewed and approved
□ Edge cases documented and handled
□ No compiler warnings
□ Code formatted with rustfmt
□ Documentation comments on public APIs
□ Manual testing completed (checklist above)
□ Changes committed with purpose-focused message
- Split pane view - Code preview alongside conversation
- Vim keybindings - For power users
- Custom themes - User-definable color schemes
- Plugin system - Custom slash commands via Lua/Rust
- Fuzzy file picker - Quick file selection with fzf-style interface
- Web UI mode - Optional browser-based interface
- Voice input - Whisper integration for voice commands
- Snippets - Save and reuse common prompts
- Custom fun facts - User-contributed facts/jokes
- Session branching - Fork conversations to explore alternatives
- Obsidian vault path:
~/Documents/Personal/ - Platform: Cross-platform (macOS primary)
- Multi-line submit: Double-enter
- Multi-agent status bar: Core feature
- Tool verbosity: Standard (action + target)
- Session persistence: SpecStory auto-save
-
/commitdefault: Agent decides,--pickfor interactive - Token display: Cumulative session with context bar
- Fun facts: Fetch from API (with local cache fallback)
- Context bar position: Bottom of CLI
- Cost display: Separate (
/costcommand), not in context bar - Startup: Welcome message with new/resume session options
-
/clearbehavior: Resets both display AND context
Claude API Integration (Critical)
- CLI now makes real API calls to Claude (was previously just echoing input)
- Added
ureqanddotenvydependencies for HTTP requests and env loading - Loads
ANTHROPIC_API_KEYfrom environment or.envfile - Multi-turn conversation memory (Claude remembers context)
- Shows "Thinking..." indicator during API calls
Terminal Raw Mode Fix (Bug Fix)
- Fixed broken layout where ASCII logo and text drifted right
- Root cause: In raw mode,
\nonly moves cursor down, doesn't return to column 0 - Solution: Changed all newlines to
\r\nin startup screen, REPL, and input handler - Added
print_line()andprint_newline()helpers to REPL for consistent handling
Command Aliases (UX Improvement)
- Added
/quitas alias for/exit(common user expectation) - Added
/qas short alias for/exit - Added error message for bare
/input (was being treated as regular message)
Test Count: 210 tests passing (206 unit + 3 integration + 1 doc test)
Last updated: 2026-02-13