Skip to content

feat: Duck Regression Test - CI for LLM behavior drift detectionΒ #43

@nesquikm

Description

@nesquikm

πŸ¦† Duck Enhancement Proposal

πŸ’‘ The Problem

Silent LLM API updates are a real risk. One day your prompt works perfectly; the next day it rambles or refuses innocuous requests. Nobody offers "CI for LLM behavior" in MCP.

πŸš€ Proposed Solution

// Store a test case
duck_regression_add({
  name: "code_review_format",
  prompt: "Review this function: function add(a,b) { return a+b }",
  provider: "openai",
  expected_behavior: {
    contains: ["return type", "parameter types"],
    not_contains: ["error", "cannot"],
    max_length: 500,
    sentiment: "constructive"
  }
})

// Run regression tests
duck_regression_run({
  suite: "code_review",  // or "all"
  threshold: 0.8  // 80% similarity to baseline
})

// Returns
{
  passed: 4,
  failed: 1,
  drifted: [
    {
      name: "code_review_format",
      baseline_date: "2025-01-15",
      similarity: 0.62,
      changes: ["Now includes emoji", "Missing type suggestions"],
      recommendation: "Update baseline or investigate model change"
    }
  ]
}

πŸ¦† Duck Use Cases

  • Detect when provider silently updates their model
  • Ensure prompts still work after config changes
  • CI/CD integration for prompt engineering

πŸ“‹ Implementation

  1. src/services/regression.ts - Test storage and comparison
  2. src/tools/duck-regression.ts - add/run/list/baseline tools
  3. Storage: JSON file in ~/.mcp-rubber-duck/regression/
  4. Comparison: Semantic similarity + rule-based checks

🌟 Research Backing

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions