[kbn-evals] Improve @kbn/evals CLI for running evals locally#254855
Open
spong wants to merge 7 commits intoelastic:mainfrom
Open
[kbn-evals] Improve @kbn/evals CLI for running evals locally#254855spong wants to merge 7 commits intoelastic:mainfrom
@kbn/evals CLI for running evals locally#254855spong wants to merge 7 commits intoelastic:mainfrom
Conversation
Contributor
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
cc @spong |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While we made strides to improve the local eval running experience in #245064, the ergonomics were still no bueno. It'd take setting multiple ENV variables, running three different commands in different terminal sessions and some manual connector management. Well, this fixes all that 🙂
Also added two new agent skills under
.agents/skills/to guide AI agents through the LLM evaluation workflow:evals-create-suite-- Scaffolding skill for creating new eval suite packages from scratch. Covers the full boilerplate:kibana.jsonc,package.json,tsconfig.json,playwright.config.ts,src/evaluate.ts, suite registration inevals.suites.json, and a decision tree for when to extend the baseevaluatefixture vs use it directly. Includes real-world reference examples fromllm-tasks,agent-builder, andsecurity-solution-evals.evals-write-spec-- Authoring skill for writing eval spec files. Covers spec file anatomy (evaluate.describe/evaluate/beforeAll/afterAll), Scout tags, dataset structure, task functions, all evaluator types (CODE, LLM-as-judge criteria, correctness, groundedness, trace-based, RAG), theevaluateDatasethelper pattern, available fixtures, setup/teardown patterns, and local running with--grep/--model/--judge. Includes areferences/evaluator-patterns.mdwith extended real-world evaluator examples extracted from existing suites.new_evals_cli.mov
How to test
Set up connectors (or export KIBANA_TESTING_AI_CONNECTORS manually)
Start the interactive runner
Run a quick eval with grep filter
Check services are running
Stop when done
New commands
initkibana.dev.ymlpreconfigured connectors)startstoplogsscoutnode scripts/scout.js start-serverwith evals defaultsKey improvements
startruns, cutting iteration time significantly. Service state is tracked intarget/evals/services.json.KIBANA_TESTING_AI_CONNECTORSandkibana.dev.yml.--modelfor--project,--judgefor--evaluation-connector-idfor clarity.--grepsupport: Filter which tests run within a suite (passes through to Playwright), enabling fast iteration on specific test cases.kibana.dev.ymldefaults), andTRACING_ES_URLis automatically set so trace-based evaluators query the correct cluster.EVALUATIONS_ES_URLis also auto-configured so eval results are visible in local Kibana.waitForScoutReadynow probes ES and Kibana HTTP endpoints (not just config file existence) to avoid premature EIS CCM enablement.pgrepwithps axfor cross-platform compatibility.KIBANA_TESTING_AI_CONNECTORShas changed since it was launched.doctor: Structured check results with pass/fail/warn status, interactive auto-fix prompts, and--fixflag.CLI.mdwith full command reference, flag tables, and usage tips.PR developed with cursor-cli + Claude-4.6-opus-high