Skip to content

Latest commit

 

History

History
294 lines (254 loc) · 13.4 KB

File metadata and controls

294 lines (254 loc) · 13.4 KB

Prompt Tester - Architecture Overview

Project Purpose

A bash-based framework for testing LLM prompts with support for dual-format test cases (YAML/JSON), automated assertions, metrics tracking, and multiple output reporters.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         prompt-tester (bin/)                     │
│                    Main Entry Point & CLI                        │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      lib/parser/                                │
│         ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│         │ YAML Parser  │  │ JSON Parser  │  │  Load Case   │    │
│         │   (yq)       │  │   (jq)       │  │  (Auto-Detect)│   │
│         └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      lib/runner/                                │
│         ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│         │ Execute Test │  │   Aggregate  │  │   Parallel   │    │
│         │   (LLM Call) │  │   Results    │  │  Execution   │    │
│         └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   lib/assertions/                               │
│         ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│         │   Regex      │  │   Equality   │  │  Substring   │    │
│         └──────────────┘  └──────────────┘  └──────────────┘    │
│         ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│         │   Metric     │  │  JSON        │  │              │    │
│         │  Validator   │  │ Structure    │  │              │    │
│         └──────────────┘  └──────────────┘                   │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   lib/metrics/                                  │
│         ┌──────────────┐  ┌──────────────┐                      │
│         │  Execution   │  │   Pass/Fail  │                      │
│         │    Time      │  │  Calculator  │                      │
│         └──────────────┘  └──────────────┘                      │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                   reporters/                                    │
│         ┌──────────────┐  ┌──────────────┐  ┌──────────────┐    │
│         │   Console    │  │    JSON      │  │    JUnit     │    │
│         │   Reporter   │  │  Reporter    │  │   Reporter   │    │
│         └──────────────┘  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. Parser Layer (lib/parser/)

Purpose: Load and parse test cases from YAML or JSON format

Files:

  • load_test_case.sh - Unified loader with format auto-detection (based on file extension)
  • parse_yaml.sh - YAML parser using yq with pure bash fallback
  • parse_json.sh - JSON parser using jq

Test Case Schema:

test_suite:
  name: "Suite Name"
  test_cases:
    - id: "unique_id"
      name: "Human-readable description"
      input:
        prompt: "The actual LLM prompt template"
        parameters:
          param_name: "param_value"
      assertions:
        - type: "substring"
          value: "expected text"
        - type: "regex"
          value: "pattern"
        - type: "field_exists"
          field: "path.to.field"
        - type: "array_length"
          min: 1
          max: 10
        - type: "unique_values"
        - type: "metric"
          metric_name: "execution_time"
          operator: "<"
          threshold: 5.0
      expected_metrics:
        execution_time:
          max: 5.0

Key Design Decisions:

  • Format detection based on file extension (.yaml, .yml, .json)
  • YAML parser uses yq command (if available) with pure bash fallback
  • JSON parser uses jq for reliable parsing
  • Unified interface through load_test_case.sh regardless of format

2. Runner Layer (lib/runner/)

Purpose: Execute test cases, aggregate results, support parallel execution

Files:

  • execute_test_case.sh - Executes individual test case (includes LLM API call)
  • test_aggregator.sh - Aggregates results from multiple test runs
  • run_parallel_tests.sh - Runs tests concurrently for performance

Flow:

  1. Parse test case
  2. Substitute parameters into prompt template
  3. Call LLM API (external)
  4. Run assertions against output
  5. Calculate metrics
  6. Aggregate results

Key Design Decisions:

  • Parallel execution for faster test suites
  • Aggregation supports multiple test suite runs
  • Metrics tracking is integrated into execution flow

3. Assertions Layer (lib/assertions/)

Purpose: Validate LLM output against expected criteria

Assertion Types:

  • assertion_substring.sh - Check if output contains expected substring
  • assertion_regex.sh - Validate output matches regex pattern
  • assertion_equality.sh - Exact match validation
  • validate_metric.sh - Validate calculated metrics (execution time, etc.)
  • validate_json_structure.sh - Validate JSON structure/schema
  • assertion_json.sh - General JSON assertions
  • assertion_unique_values.sh - Verify unique values in arrays

Key Design Decisions:

  • Each assertion type has dedicated script
  • Return codes indicate pass (0) or fail (non-zero)
  • Assertion failures are logged but don't stop execution
  • Supports metric-based assertions (e.g., execution time < threshold)

4. Metrics Layer (lib/metrics/)

Purpose: Calculate and track performance metrics

Files:

  • calculate_execution_time.sh - Measure time from prompt to response
  • calculate_pass_fail.sh - Calculate pass/fail rates from test results
  • usage_example.sh - Example usage patterns

Metrics Tracked:

  • execution_time - Time to generate response
  • pass_rate - Percentage of assertions that passed
  • custom_metrics - Extensible metric framework

Key Design Decisions:

  • Metrics are calculated during test execution
  • Metrics can be used as assertions
  • Support for custom metric definitions

5. Reporters (reporters/)

Purpose: Generate human-readable and machine-parseable reports

Reporters:

  • console.sh - Terminal output with colors, progress bars, summary
  • json.sh - JSON output for programmatic consumption
  • junit.sh - JUnit XML format for CI/CD integration

Key Design Decisions:

  • Multiple output formats for different use cases
  • Console reporter provides interactive feedback
  • JSON reporter enables custom tooling
  • JUnit reporter integrates with existing CI/CD pipelines

6. Main Entry Point (bin/prompt-tester)

Purpose: CLI interface, orchestrates entire test flow

Key Functions:

  • Argument parsing (--test-case, --output, --reporter, --parallel, etc.)
  • Test suite discovery and loading
  • Execution orchestration
  • Report generation and output

CLI Usage:

./bin/prompt-tester --test-case test.yaml --reporter console
./bin/prompt-tester --test-case tests/*.yaml --reporter junit --output results.xml
./bin/prompt-tester --test-case suite.yaml --parallel --reporter json

Data Flow

1. User invokes: prompt-tester --test-case <file>
   │
   ▼
2. bin/prompt-tester parses arguments
   │
   ▼
3. load_test_case.sh detects format and parses test case
   │
   ▼
4. execute_test_case.sh runs:
   ├── Substitute parameters into prompt
   ├── Call LLM API (external)
   ├── Run all assertions
   ├── Calculate metrics
   └── Store results
   │
   ▼
5. test_aggregator.sh aggregates results (if multiple runs)
   │
   ▼
6. Selected reporter generates output:
   ├── console.sh → terminal
   ├── json.sh → JSON file
   └── junit.sh → JUnit XML
   │
   ▼
7. Exit with appropriate status code

Test Case Format Evolution

Original Plan (PLAN.md)

  • YAML-based test cases with test_suitetest_cases structure
  • Support for input, assertions, expected_metrics
  • Multiple assertion types (substring, regex, field_exists, array_length)

Current Implementation

  • Dual-format support (YAML and JSON)
  • Enhanced schema with parameters substitution
  • Metric validation as assertions
  • JSON structure validation
  • Parallel execution support
  • Multiple reporter options

Migration Path

Current bash implementation is designed as a stepping stone toward a Go-based system with:

  • Improved performance
  • Better type safety
  • Enhanced parallelism
  • More robust error handling

Implementation Status

Completed Components

  • ✅ Main entry point (bin/prompt-tester)
  • ✅ YAML parser with yq + bash fallback
  • ✅ JSON parser (jq)
  • ✅ Unified test case loader
  • ✅ Console reporter
  • ✅ JSON reporter
  • ✅ JUnit reporter
  • ✅ Substring assertions
  • ✅ Regex assertions
  • ✅ Metric validation

Pending Components (from TODO.md)

  • ⏳ Execute test case runner
  • ⏳ Test result aggregator
  • ⏳ Parallel execution
  • ⏳ Equality assertions
  • ⏳ JSON structure validation
  • ⏳ Metrics calculation
  • ⏳ Full test suite integration

Key Design Principles

  1. Modularity: Clear separation of concerns across layers
  2. Extensibility: Easy to add new assertion types, reporters, or metrics
  3. Format Flexibility: Support both YAML and JSON test case formats
  4. CI/CD Friendly: JUnit output, exit codes, structured logging
  5. Migration Path: Bash implementation designed to inform Go version

Technology Stack

  • Language: Bash 4.0+
  • Parsing: yq (YAML), jq (JSON)
  • Execution: External LLM API calls
  • Reporting: Console (colors), JSON, JUnit XML

Integration Points

  • LLM APIs: Called from execute_test_case.sh (external dependency)
  • CI/CD: JUnit reporter for integration with Jenkins, GitHub Actions, etc.
  • Custom Tooling: JSON reporter for custom dashboards/analysis
  • Test Data: testdata/ directory for sample test cases