Skip to content

[kbn-evals] Improve @kbn/evals CLI for running evals locally#254855

Open
spong wants to merge 7 commits intoelastic:mainfrom
spong:evals-cli
Open

[kbn-evals] Improve @kbn/evals CLI for running evals locally#254855
spong wants to merge 7 commits intoelastic:mainfrom
spong:evals-cli

Conversation

@spong
Copy link
Member

@spong spong commented Feb 25, 2026

Summary

While we made strides to improve the local eval running experience in #245064, the ergonomics were still no bueno. It'd take setting multiple ENV variables, running three different commands in different terminal sessions and some manual connector management. Well, this fixes all that 🙂

Also added two new agent skills under .agents/skills/ to guide AI agents through the LLM evaluation workflow:

  • evals-create-suite -- Scaffolding skill for creating new eval suite packages from scratch. Covers the full boilerplate: kibana.jsonc, package.json, tsconfig.json, playwright.config.ts, src/evaluate.ts, suite registration in evals.suites.json, and a decision tree for when to extend the base evaluate fixture vs use it directly. Includes real-world reference examples from llm-tasks, agent-builder, and security-solution-evals.

  • evals-write-spec -- Authoring skill for writing eval spec files. Covers spec file anatomy (evaluate.describe/evaluate/beforeAll/afterAll), Scout tags, dataset structure, task functions, all evaluator types (CODE, LLM-as-judge criteria, correctness, groundedness, trace-based, RAG), the evaluateDataset helper pattern, available fixtures, setup/teardown patterns, and local running with --grep/--model/--judge. Includes a references/evaluator-patterns.md with extended real-world evaluator examples extracted from existing suites.

new_evals_cli.mov

How to test

Set up connectors (or export KIBANA_TESTING_AI_CONNECTORS manually)

node scripts/evals init

Start the interactive runner

node scripts/evals start

Run a quick eval with grep filter

node scripts/evals start --suite agent-builder --grep "product documentation"

Check services are running

node scripts/evals logs

Stop when done

node scripts/evals stop

New commands

Command Description
init Interactive wizard for connector setup (EIS via Vault or kibana.dev.yml preconfigured connectors)
start Orchestrates EDOT + Scout as persistent background daemons, enables EIS CCM, runs Playwright eval suite
stop Gracefully shuts down backgrounded EDOT and Scout services
logs Tails log output from background services
scout Convenience wrapper for node scripts/scout.js start-server with evals defaults

Key improvements

  • Persistent daemon services: EDOT and Scout run as detached background processes that survive between start runs, cutting iteration time significantly. Service state is tracked in target/evals/services.json.
  • Interactive prompts: When flags are omitted in a TTY, the CLI prompts for suite, judge connector, and model selection -- merging connectors from both KIBANA_TESTING_AI_CONNECTORS and kibana.dev.yml.
  • Flag aliases: --model for --project, --judge for --evaluation-connector-id for clarity.
  • --grep support: Filter which tests run within a suite (passes through to Playwright), enabling fast iteration on specific test cases.
  • Trace routing: EDOT exports traces to the user's local ES (from kibana.dev.yml defaults), and TRACING_ES_URL is automatically set so trace-based evaluators query the correct cluster. EVALUATIONS_ES_URL is also auto-configured so eval results are visible in local Kibana.
  • Robust readiness checks: waitForScoutReady now probes ES and Kibana HTTP endpoints (not just config file existence) to avoid premature EIS CCM enablement.
  • Portable process detection: Replaced pgrep with ps ax for cross-platform compatibility.
  • Stale service detection: Scout is automatically restarted if KIBANA_TESTING_AI_CONNECTORS has changed since it was launched.
  • Enhanced doctor: Structured check results with pass/fail/warn status, interactive auto-fix prompts, and --fix flag.
  • CLI reference doc: New CLI.md with full command reference, flag tables, and usage tips.

PR developed with cursor-cli + Claude-4.6-opus-high

@spong spong self-assigned this Feb 25, 2026
@spong spong requested a review from a team as a code owner February 25, 2026 07:53
@spong spong added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.4.0 labels Feb 25, 2026
@spong spong requested a review from a team as a code owner February 25, 2026 07:58
Copy link
Contributor

@patrykkopycinski patrykkopycinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #8 / When a Console command is entered by the user should clear the command output history when clear is entered

Metrics [docs]

✅ unchanged

History

cc @spong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants