Name	Name	Last commit message	Last commit date
parent directory ..
examples	examples
src	src
ui	ui
.gitignore	.gitignore
.oxfmtrc.json	.oxfmtrc.json
.oxlintrc.json	.oxlintrc.json
LICENSE	LICENSE
README.md	README.md
package-lock.json	package-lock.json
package.json	package.json
tsconfig.json	tsconfig.json

WebMCP Evals

A TypeScript framework for evaluating the tool-calling capabilities of Large Language Models (LLMs). This project allows you to define test cases (evals) and schemas to verify if an interactive agent correctly calls tools based on user inputs.

Features

Backends: Execution support for @google/genai (gemini). Integrations for local models (ollama), as well as full Vercel AI SDK (vercel) integration for robust multi-provider support (OpenAI, Anthropic) are available but currently considered experimental.
Evaluation Loop: Automatically runs defined test cases against the model and compares actual tool calls with expected ones.
Extensible: Easy to add new backends or evaluation sets.

Architecture

The project is structured as follows:

src/: Source code.
- bin/: CLI command entry points.
  - runevals.ts: Runs the evaluation loop using static tool schemas from a JSON file.
  - webmcpevals.ts: Runs the evaluation loop by fetching tool schemas live from a browser page via the WebMCP API.
  - serve.ts: Starts the WebMCP Evals Web UI sidecar.
- evaluator/: The core evaluation engine.
  - index.ts: Orchestrates the evaluation execution loops.
  - models.ts: LLM instantiation mappings (Gemini, OpenAI, Anthropic, Ollama).
  - browser.ts: Browser automation (Puppeteer) for testing and discovering WebMCP tools.
  - mappers.ts & prompts.ts: Data standardization utilities and system prompts.
- server/: Express backend for the sidecar Web UI.
- report/: Generates HTML execution reports.
- types/: TypeScript definitions for configurations, evaluations, and tool schemas.
- matcher.ts: Evaluates the correctness of AI-generated tool calls against expected constraints.
examples/: Detailed examples and test data.
- travel/: A travel agent example containing schema.json and evals.json.

Prerequisites

Node.js (v18+ recommended)
Appropriate API Keys (Google GenAI, OpenAI, or Anthropic, depending on the backend used)
Chrome Canary 146+ with the #enable-webmcp-testing flag enabled (for webmcpevals only)

Setup

Install Dependencies
```
npm install
```
Note: This will also automatically install dependencies for the ui sub-project via a postinstall script.

Configure Environment

Create a .env file in the root directory and add any required configuration or API keys:

GOOGLE_AI=your_gemini_api_key
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
# OLLAMA_HOST=http://localhost:11434 (if using a remote or non-standard port Ollama)

Build the Project

Compile the TypeScript code to JavaScript:
```
npm run build
```

Usage

`runevals` — file-based tool schemas

Loads tool schemas from a local JSON file.

node dist/bin/runevals.js --tools=examples/travel/schema.json --evals=examples/travel/evals.json

With Ollama:

node dist/bin/runevals.js --model=qwen3:8b --backend=ollama --tools=examples/travel/schema.json --evals=examples/travel/evals.json

Argument	Required	Default	Description
`--tools`	Yes	—	Path to the tool schema JSON file
`--evals`	Yes	—	Path to the evals JSON file
`--backend`	No	`gemini`	Execution backend (`gemini`, `ollama`, or `vercel`)
`--provider`	No	`google`	Model provider (e.g., `openai`, `anthropic`, `google`) if using `vercel`
`--model`	No	`gemini-2.5-flash`	Model name

`webmcpevals` — live tool schemas via WebMCP

Launches Chrome Canary, navigates to the given URL, and retrieves tool schemas live from the page via navigator.modelContextTesting.listTools(). Requires Chrome Canary 146+ with the chrome://flags/#enable-webmcp-testing flag enabled.

node dist/bin/webmcpevals.js --url=https://example.com/my-webmcp-app --evals=examples/travel/evals.json

Argument	Required	Default	Description
`--url`	Yes	—	URL of the page exposing WebMCP tools
`--evals`	Yes	—	Path to the evals JSON file
`--backend`	No	`vercel`	Must be `vercel` (live browser evaluation requires `ToolLoopAgent`)
`--provider`	No	`gemini`	Model provider to use with Vercel (e.g., `openai`, `anthropic`)
`--model`	No	`gemini-2.5-flash`	Model name

`serve` — WebMCP Evals UI sidecar

Starts a local web server to provide a visual interface for configuring and running evaluations.

node dist/bin/serve.js --port=8080

Argument	Required	Default	Description
`--port`	No	`8080`	Port to run the server on

Expected Call Evaluation

By default, an array of expected tool calls in expectedCall are checked in the exact sequential order they appear (an implicit ordered list).

You can use explicit ordered and unordered wrappers to validate trajectories that may occur in a different, non-sequential order, and nest them as deeply as needed.

Complex Orderings Example

{
  "expectedCall": [
    { "functionName": "login" },
    {
      "unordered": [
        { "functionName": "set_pizza_style", "arguments": { "style": "Pesto" } },
        { "functionName": "set_pizza_size", "arguments": { "size": "Small" } }
      ]
    },
    { "functionName": "checkout" }
  ]
}

ordered: Validates that all nested models must be executed in exactly the sequential order provided.
unordered: Validates that all nested models must be executed, but they can arrive in any order.

Argument Constraints

You can use constraint operators to match argument values flexibly. A constraint object is identified when all its keys start with $.

Supported Operators

Operator	Description	Example
`$pattern`	Regex match	`{"$pattern": "^2026-\\d{2}$"}`
`$contains`	Substring match	`{"$contains": "York"}`
`$gt`, `$gte`	Greater than (or equal)	`{"$gte": 1}`
`$lt`, `$lte`	Less than (or equal)	`{"$lt": 100}`
`$type`	Type check	`{"$type": "string"}`
`$any`	Presence check	`{"$any": true}`

Example

{
  "expectedCall": {
    "functionName": "searchFlights",
    "arguments": {
      "destination": "NYC",
      "outboundDate": { "$pattern": "^2026-01-\\d{2}$" },
      "passengers": { "$gte": 1 },
      "preferences": { "$any": true }
    }
  }
}

License

Apache-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

WebMCP Evals

Features

Architecture

Prerequisites

Setup

Usage

`runevals` — file-based tool schemas

`webmcpevals` — live tool schemas via WebMCP

`serve` — WebMCP Evals UI sidecar

Expected Call Evaluation

Complex Orderings Example

Argument Constraints

Supported Operators

Example

License

FilesExpand file tree

evals-cli

Directory actions

More options

Directory actions

More options

Latest commit

History

evals-cli

Folders and files

parent directory

README.md

WebMCP Evals

Features

Architecture

Prerequisites

Setup

Usage

runevals — file-based tool schemas

webmcpevals — live tool schemas via WebMCP

serve — WebMCP Evals UI sidecar

Expected Call Evaluation

Complex Orderings Example

Argument Constraints

Supported Operators

Example

License

`runevals` — file-based tool schemas

`webmcpevals` — live tool schemas via WebMCP

`serve` — WebMCP Evals UI sidecar