This is for evaluating A2UI (v0.9) against various LLMs.
This version embeds the JSON schemas directly into the prompt and instructs the LLM to output a JSON object within a markdown code block. The framework then extracts and validates this JSON.
To use the models, you need to set the following environment variables with your API keys:
GEMINI_API_KEYOPENAI_API_KEYANTHROPIC_API_KEY
You can set these in a .env file in the root of the project, or in your shell's configuration file (e.g., .bashrc, .zshrc).
A .env.example file is provided as a template:
cp .env.example .env
# Edit .env with your API keys (do not commit .env)You also need to install dependencies before running:
pnpm installTo run the flow, use the following command:
pnpm run evalAllYou can run the script for a single model and data point by using the --model and --prompt command-line flags. This is useful for quick tests and debugging.
pnpm run eval --model=<model_name> --prompt=<prompt_name>To run the test with the gemini-2.5-flash-lite model and the loginForm prompt, use the following command:
pnpm run eval --model=gemini-2.5-flash-lite --prompt=loginFormBy default, the script prints a progress bar and the final summary table to the console. Detailed logs are written to output.log in the results directory.
--log-level=<level>: Sets the console logging level (default:info). Options:error,warn,info,http,verbose,debug,silly.- Note: The file log (
output.login the results directory) always capturesdebuglevel logs regardless of this setting.
- Note: The file log (
--results=<output_dir>: (Default:results/output-<model>orresults/output-combinedif multiple models are specified) Preserves output files. To specify a custom directory, use--results=my_results.--clean-results: If set, cleans the results directory before running tests.--runs-per-prompt=<number>: Number of times to run each prompt (default: 1).--model=<model_name>: (Default: all models) Run only the specified model(s). Can be specified multiple times.--prompt=<prompt_name>: (Default: all prompts) Run only the specified prompt.
Run with debug output in console:
pnpm run eval -- --log-level=debugRun 5 times per prompt and clean previous results:
pnpm run eval -- --runs-per-prompt=5 --clean-resultsThe framework includes a two-tiered rate limiting system:
- Proactive Limiting: Locally tracks token and request usage to stay within configured limits (defined in
src/models.ts). - Reactive Circuit Breaker: Automatically pauses requests to a model if a
RESOURCE_EXHAUSTED(429) error is received, resuming only after the requested retry duration.