[kbn-evals] Improve `@kbn/evals` CLI for running evals locally by spong · Pull Request #254855 · elastic/kibana

spong · 2026-02-25T07:53:08Z

Summary

While we made strides to improve the local eval running experience in #245064, the ergonomics were still no bueno. It'd take setting multiple ENV variables, running three different commands in different terminal sessions and some manual connector management. Well, this fixes all that 🙂

Also added two new agent skills under .agents/skills/ to guide AI agents through the LLM evaluation workflow:

evals-create-suite -- Scaffolding skill for creating new eval suite packages from scratch. Covers the full boilerplate: kibana.jsonc, package.json, tsconfig.json, playwright.config.ts, src/evaluate.ts, suite registration in evals.suites.json, and a decision tree for when to extend the base evaluate fixture vs use it directly. Includes real-world reference examples from llm-tasks, agent-builder, and security-solution-evals.
evals-write-spec -- Authoring skill for writing eval spec files. Covers spec file anatomy (evaluate.describe/evaluate/beforeAll/afterAll), Scout tags, dataset structure, task functions, all evaluator types (CODE, LLM-as-judge criteria, correctness, groundedness, trace-based, RAG), the evaluateDataset helper pattern, available fixtures, setup/teardown patterns, and local running with --grep/--model/--judge. Includes a references/evaluator-patterns.md with extended real-world evaluator examples extracted from existing suites.

new_evals_cli.mov

How to test

Set up connectors (or export KIBANA_TESTING_AI_CONNECTORS manually)

node scripts/evals init

Start the interactive runner

node scripts/evals start

Run a quick eval with grep filter

node scripts/evals start --suite agent-builder --grep "product documentation"

Check services are running

node scripts/evals logs

Stop when done

node scripts/evals stop

New commands

Command	Description
`init`	Interactive wizard for connector setup (EIS via Vault or `kibana.dev.yml` preconfigured connectors)
`start`	Orchestrates EDOT + Scout as persistent background daemons, enables EIS CCM, runs Playwright eval suite
`stop`	Gracefully shuts down backgrounded EDOT and Scout services
`logs`	Tails log output from background services
`scout`	Convenience wrapper for `node scripts/scout.js start-server` with evals defaults

Key improvements

Persistent daemon services: EDOT and Scout run as detached background processes that survive between start runs, cutting iteration time significantly. Service state is tracked in target/evals/services.json.
Interactive prompts: When flags are omitted in a TTY, the CLI prompts for suite, judge connector, and model selection -- merging connectors from both KIBANA_TESTING_AI_CONNECTORS and kibana.dev.yml.
Flag aliases: --model for --project, --judge for --evaluation-connector-id for clarity.
--grep support: Filter which tests run within a suite (passes through to Playwright), enabling fast iteration on specific test cases.
Trace routing: EDOT exports traces to the user's local ES (from kibana.dev.yml defaults), and TRACING_ES_URL is automatically set so trace-based evaluators query the correct cluster. EVALUATIONS_ES_URL is also auto-configured so eval results are visible in local Kibana.
Robust readiness checks: waitForScoutReady now probes ES and Kibana HTTP endpoints (not just config file existence) to avoid premature EIS CCM enablement.
Portable process detection: Replaced pgrep with ps ax for cross-platform compatibility.
Stale service detection: Scout is automatically restarted if KIBANA_TESTING_AI_CONNECTORS has changed since it was launched.
Enhanced doctor: Structured check results with pass/fail/warn status, interactive auto-fix prompts, and --fix flag.
CLI reference doc: New CLI.md with full command reference, flag tables, and usage tips.

PR developed with cursor-cli + Claude-4.6-opus-high

patrykkopycinski

😍

elasticmachine · 2026-02-25T09:30:46Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 3e1ff51

Failed CI Steps

Jest Tests #8

Test Failures

[job] [logs] Jest Tests #8 / When a Console command is entered by the user should clear the command output history when clear is entered

Metrics [docs]

✅ unchanged

History

💔 Build #400488 failed 4745868

cc @spong

spong added 2 commits February 25, 2026 00:34

Improve evals CLI

4705359

Fix the doctor fixing things

658f6f4

spong requested review from SrdjanLL, abhi-elastic and kderusso February 25, 2026 07:53

spong self-assigned this Feb 25, 2026

spong requested a review from a team as a code owner February 25, 2026 07:53

spong added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.4.0 labels Feb 25, 2026

Add skills for creating suites and writing specs

4745868

spong requested a review from a team as a code owner February 25, 2026 07:58

patrykkopycinski approved these changes Feb 25, 2026

View reviewed changes

Remove unused import

3e1ff51

spong and others added 3 commits February 25, 2026 16:02

Update skill WRT scout config generation and location

536ff3a

Merge branch 'main' of github.com:elastic/kibana into evals-cli

4b11e2d

Merge branch 'main' into evals-cli

eeb166d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kbn-evals] Improve `@kbn/evals` CLI for running evals locally#254855

[kbn-evals] Improve `@kbn/evals` CLI for running evals locally#254855
spong wants to merge 7 commits intoelastic:mainfrom
spong:evals-cli

spong commented Feb 25, 2026 •

edited

Loading

Uh oh!

patrykkopycinski left a comment

Uh oh!

elasticmachine commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spong commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

New commands

Key improvements

Uh oh!

patrykkopycinski left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Feb 25, 2026

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spong commented Feb 25, 2026 •

edited

Loading