Skip to content

Latest commit

 

History

History

README.md

WebMCP Evals

A TypeScript framework for evaluating the tool-calling capabilities of Large Language Models (LLMs). This project allows you to define test cases (evals) and schemas to verify if an interactive agent correctly calls tools based on user inputs.

Features

  • Backends: Execution support for @google/genai (gemini). Integrations for local models (ollama), as well as full Vercel AI SDK (vercel) integration for robust multi-provider support (OpenAI, Anthropic) are available but currently considered experimental.
  • Evaluation Loop: Automatically runs defined test cases against the model and compares actual tool calls with expected ones.
  • Extensible: Easy to add new backends or evaluation sets.

Architecture

The project is structured as follows:

  • src/: Source code.
    • bin/: CLI command entry points.
      • runevals.ts: Runs the evaluation loop using static tool schemas from a JSON file.
      • webmcpevals.ts: Runs the evaluation loop by fetching tool schemas live from a browser page via the WebMCP API.
      • serve.ts: Starts the WebMCP Evals Web UI sidecar.
    • evaluator/: The core evaluation engine.
      • index.ts: Orchestrates the evaluation execution loops.
      • models.ts: LLM instantiation mappings (Gemini, OpenAI, Anthropic, Ollama).
      • browser.ts: Browser automation (Puppeteer) for testing and discovering WebMCP tools.
      • mappers.ts & prompts.ts: Data standardization utilities and system prompts.
    • server/: Express backend for the sidecar Web UI.
    • report/: Generates HTML execution reports.
    • types/: TypeScript definitions for configurations, evaluations, and tool schemas.
    • matcher.ts: Evaluates the correctness of AI-generated tool calls against expected constraints.
  • examples/: Detailed examples and test data.
    • travel/: A travel agent example containing schema.json and evals.json.

Prerequisites

  • Node.js (v18+ recommended)
  • Appropriate API Keys (Google GenAI, OpenAI, or Anthropic, depending on the backend used)
  • Chrome Canary 146+ with the #enable-webmcp-testing flag enabled (for webmcpevals only)

Setup

  1. Install Dependencies

    npm install

    Note: This will also automatically install dependencies for the ui sub-project via a postinstall script.

  2. Configure Environment

    Create a .env file in the root directory and add any required configuration or API keys:

    GOOGLE_AI=your_gemini_api_key
    OPENAI_API_KEY=your_openai_api_key
    ANTHROPIC_API_KEY=your_anthropic_api_key
    # OLLAMA_HOST=http://localhost:11434 (if using a remote or non-standard port Ollama)
  3. Build the Project

    Compile the TypeScript code to JavaScript:

    npm run build

Usage

runevals — file-based tool schemas

Loads tool schemas from a local JSON file.

node dist/bin/runevals.js --tools=examples/travel/schema.json --evals=examples/travel/evals.json

With Ollama:

node dist/bin/runevals.js --model=qwen3:8b --backend=ollama --tools=examples/travel/schema.json --evals=examples/travel/evals.json
Argument Required Default Description
--tools Yes Path to the tool schema JSON file
--evals Yes Path to the evals JSON file
--backend No gemini Execution backend (gemini, ollama, or vercel)
--provider No google Model provider (e.g., openai, anthropic, google) if using vercel
--model No gemini-2.5-flash Model name

webmcpevals — live tool schemas via WebMCP

Launches Chrome Canary, navigates to the given URL, and retrieves tool schemas live from the page via navigator.modelContextTesting.listTools(). Requires Chrome Canary 146+ with the chrome://flags/#enable-webmcp-testing flag enabled.

node dist/bin/webmcpevals.js --url=https://example.com/my-webmcp-app --evals=examples/travel/evals.json
Argument Required Default Description
--url Yes URL of the page exposing WebMCP tools
--evals Yes Path to the evals JSON file
--backend No vercel Must be vercel (live browser evaluation requires ToolLoopAgent)
--provider No gemini Model provider to use with Vercel (e.g., openai, anthropic)
--model No gemini-2.5-flash Model name

serve — WebMCP Evals UI sidecar

Starts a local web server to provide a visual interface for configuring and running evaluations.

node dist/bin/serve.js --port=8080
Argument Required Default Description
--port No 8080 Port to run the server on

Expected Call Evaluation

By default, an array of expected tool calls in expectedCall are checked in the exact sequential order they appear (an implicit ordered list).

You can use explicit ordered and unordered wrappers to validate trajectories that may occur in a different, non-sequential order, and nest them as deeply as needed.

Complex Orderings Example

{
  "expectedCall": [
    { "functionName": "login" },
    {
      "unordered": [
        { "functionName": "set_pizza_style", "arguments": { "style": "Pesto" } },
        { "functionName": "set_pizza_size", "arguments": { "size": "Small" } }
      ]
    },
    { "functionName": "checkout" }
  ]
}
  • ordered: Validates that all nested models must be executed in exactly the sequential order provided.
  • unordered: Validates that all nested models must be executed, but they can arrive in any order.

Argument Constraints

You can use constraint operators to match argument values flexibly. A constraint object is identified when all its keys start with $.

Supported Operators

Operator Description Example
$pattern Regex match {"$pattern": "^2026-\\d{2}$"}
$contains Substring match {"$contains": "York"}
$gt, $gte Greater than (or equal) {"$gte": 1}
$lt, $lte Less than (or equal) {"$lt": 100}
$type Type check {"$type": "string"}
$any Presence check {"$any": true}

Example

{
  "expectedCall": {
    "functionName": "searchFlights",
    "arguments": {
      "destination": "NYC",
      "outboundDate": { "$pattern": "^2026-01-\\d{2}$" },
      "passengers": { "$gte": 1 },
      "preferences": { "$any": true }
    }
  }
}

License

Apache-2.0