Evaluator Contribution

1. Prepare

Dev env see Installation.

2. Run Evaluator

rush dev:eval

3. Evaluator Guide

Main structure of web-bench could see Evaluator

├─ tools
│ ├─ evaluator
│ │ ├─ src
│ │ │ ├─ ignore
│ │ │ ├─ log
│ │ │ ├─ parser
│ │ │ ├─ plugins
│ │ │ │ ├─ evaluator-runner.ts
│ │ │ │ ├─ project-runner.ts
│ │ │ │ ├─ task-runner
│ │ │ ├─ runner
│ │ │ ├─ settings
│ │ │ ├─ utils

runner:
- evaluator-runner: Evaluation entry，the runner processes m*n project-runner (m: projects count, n: models count)
- project-runner: The runner processes tasks in sequence. Upon reaching the retry limit (2 attempts), it terminates. Evaluator-Workflow Step 2 and Step 9.
- task-runner: The runner will call agent, rewrite files, init envs,build files, tests and retry. Evaluator-Workflow Step 1 and Step 3-8.
plugins:
- In Evaluation Workflow, each step is injected in the form of a Plugin, which includes both the plugin schedule and the specific implementation of each step plugin in Evaluation Workflow.

4. Test

Execute the following command to run evaluations and view the results in apps/eval/report:

rush eval

5. Tips

In the development environment, configure parameters in apps/eval/src/config.json5:

logLevel: 'debug', get more information.
projects: ['@web-bench/xxxx'], not process all projects.

More details in Config Parameters.

Evaluation | arXiv Paper | Leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator Contribution

Evaluator Contribution

1. Prepare

2. Run Evaluator

3. Evaluator Guide

4. Test

5. Tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally