Skip to content

Latest commit

 

History

History
222 lines (177 loc) · 6.57 KB

File metadata and controls

222 lines (177 loc) · 6.57 KB

Test Case Format Improvement Plan

Overview

This plan defines an improved test case format for the existing prompt-tester framework. This is not a new project, but an evolution that simplifies the test case schema while leveraging the existing infrastructure (parsers, runners, reporters).

Key Goal: Simplify test case creation for opencode agent testing while maintaining compatibility with the existing framework architecture.

Current Format vs. Improved Format

Current Format (Complex)

test_suite:
  name: "Suite Name"
  test_cases:
    - id: "unique_id"
      name: "Human-readable description"
      input:
        prompt: "The actual LLM prompt template"
        parameters:
          param_name: "param_value"
      assertions:
        - type: "substring"
          value: "expected text"
        - type: "regex"
          value: "pattern"
        - type: "metric"
          metric_name: "execution_time"
          operator: "<"
          threshold: 5.0
      expected_metrics:
        execution_time:
          max: 5.0

Improved Format (Simplified)

agent: test  # Per-file agent specification

test_cases:
  - description: "Basic greeting test"
    prompt: "hello"
    expected: "hello world!"
    
  - description: "Regex test"
    prompt: "tell me about cats"
    expected: "/cat|feline/"

Why This is an Improvement

  1. Simpler to write: No nested structures, just description, prompt, expected
  2. Per-file agent: Single agent declaration at file level (not repeated per test case)
  3. Straightforward assertions: Only exact match or regex (no complex assertion types needed for opencode)
  4. Whitespace normalization: Built-in handling of whitespace differences
  5. Faster iteration: Less boilerplate, quicker test case creation
  6. Backward compatible: Uses existing parser, runner, and reporter infrastructure

Test File Format

Structure

Each test file contains:

  • agent: Specifies which opencode agent to use (required, per-file)
  • test_cases: Array of individual test cases

YAML Syntax

agent: test

test_cases:
  - description: "Basic greeting test"
    prompt: "hello"
    expected: "hello world!"
    
  - description: "Regex test"
    prompt: "tell me about cats"
    expected: "/cat|feline/"
    
  - description: "Complex regex"
    prompt: "generate JSON"
    expected: "/\\{.*\\}/"

Assertion Logic

String Comparison

  • Exact match between actual response and expected value (default)
  • Whitespace comparison: preserve original whitespace (spaces, tabs, newlines)
  • Future enhancement: integrate an agent to evaluate semantic equivalence of responses

Regex Support

  • expected field can be a plain string (exact match) or regex pattern
  • Regex patterns enclosed in forward slashes: /pattern/
  • If pattern doesn't match, test fails with regex error message

Error Output

  • If opencode CLI returns non-zero exit code → fail (include exit code)
  • If response doesn't match expected → fail with actual vs expected
  • Include test description in all error messages

Execution Flow

1. Load test file: Parse YAML to extract agent and test cases
2. For each test case:
   - Execute: opencode run --agent <agent> <prompt>
   - Capture stdout response
   - Compare response to expected (with whitespace normalization)
   - Mark test as pass/fail
3. Report results: Summary of passed/failed tests

File Organization

project/
├── tests/
│   ├── basic.yaml
│   ├── advanced.yaml
│   └── edge_cases.yaml
├── bin/
│   └── prompt-tester (existing, enhanced)
├── lib/
│   ├── parser/
│   │   ├── load_test_case.sh (existing, enhanced)
│   │   ├── parse_yaml.sh (existing, enhanced)
│   │   └── parse_json.sh (existing, enhanced)
│   ├── runner/
│   │   ├── execute_test_case.sh (new)
│   │   ├── test_aggregator.sh (existing, enhanced)
│   │   └── run_parallel_tests.sh (existing)
│   └── assertions/
│       ├── assertion_substring.sh (existing)
│       ├── assertion_regex.sh (existing)
│       ├── assertion_equality.sh (new/simplified)
│       └── compare.sh (new - whitespace normalization)
└── reporters/
    ├── console.sh (existing)
    ├── json.sh (existing)
    └── junit.sh (existing)

CLI Tool Integration

opencode Command

opencode run --agent <agent_name> <prompt>

Example

opencode run --agent test "hello"
# Output: "hello world!"

prompt-tester Usage (Enhanced)

# Run with new simplified format
./bin/prompt-tester --test-file tests/basic.yaml --reporter console

# Run with verbose output
./bin/prompt-tester -f tests/advanced.yaml -r console -v

# Run and save results
./bin/prompt-tester --test-file tests/all.yaml --reporter junit --output results.xml

Output Format

Console Reporter

Running tests...

✓ Basic greeting test
✗ Another test
  Expected: "expected response"
  Got:      "actual response"

Results: 2 passed, 1 failed (66.7%)

Note: Test description is used for display in reports.

Migration Notes

What Changes

  • Test case format (simplified schema)
  • Agent specification (per-file instead of per-test-case)
  • Assertion logic (simplified to exact/regex with whitespace normalization)

What Stays the Same

  • CLI interface and arguments
  • Reporter infrastructure (console, JSON, JUnit)
  • Aggregation and parallel execution support
  • Overall architecture and folder structure

Backward Compatibility

  • The existing complex format will continue to work (deprecated but supported)
  • New format is the recommended approach
  • Clear documentation distinguishes between the two formats

Assumptions

  1. YAML parsing: Will use yq tool (YAML parser)
  2. opencode CLI: Available in PATH, supports run --agent <name> <prompt>
  3. Agent configuration: Already set up in opencode (e.g., "test" agent)
  4. Response format: stdout contains the LLM response
  5. Timeout: 10 minutes default for opencode responses
  6. Whitespace: Preserved during comparison (no normalization)
  7. File extension: .yaml extension enforced for test files
  8. Future evaluation: An LLM agent may be incorporated later for semantic response comparison

Next Steps

  1. Review and refine this plan
  2. Create TODO.md with implementation tasks
  3. Update parser to support new simplified format
  4. Implement whitespace-normalized comparison logic
  5. Create sample test files to validate the new format
  6. Add migration documentation for existing test cases