Skip to content

Commit 522e957

Browse files
authored
Add README for Agent Evals (#9356)
1 parent dde651b commit 522e957

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed

scripts/agent-evals/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Agent Evals
2+
3+
This codebase evaluates the Firebase MCP server running in various coding agents.
4+
5+
## Running Tests
6+
7+
Agent Evals use [mocha](https://www.npmjs.com/package/mocha) to run tests, similar to how the Firebase CLI unit tests are implemented. The test commands will automatically instrument the Firebase MCP Server.
8+
9+
WARNING: Running evals will remove any existing Firebase MCP Servers and the Firebase Gemini CLI Extension from your user account so that they don't interfere with the test.
10+
11+
For running tests during development, run:
12+
13+
```bash
14+
# Link and build the CLI so that the `firebase` is built with your changes
15+
$ npm link
16+
$ npm run build:watch
17+
18+
# In a separate terminal, run the test suite.
19+
# Running test:dev will skip rebuilding the Firebase CLI (because your watch
20+
# command is doing that for you)
21+
$ cd scripts/agent-evals
22+
$ npm run test:dev
23+
```
24+
25+
For running in CI, the eval system will do a clean install of the Firebase CLI before running tests:
26+
27+
```bash
28+
$ npm run test
29+
```
30+
31+
## Writing Tests
32+
33+
Add a new file in `src/tests`:
34+
35+
```typescript
36+
import { startAgentTest } from "../runner/index.js";
37+
import { AgentTestRunner } from "../runner/index.js";
38+
39+
// Ensure you import hooks which instruments an afterEach block that cleans up
40+
// the agent and the pseudo terminal.
41+
import "../helpers/hooks.js";
42+
43+
describe("<prompt-or-tool-name>", function (this: Mocha.Suite) {
44+
// Recommend setting retries > 0 because LLMs are nondeterministic
45+
this.retries(2);
46+
47+
it("<use-case>", async function (this: Mocha.Context) {
48+
// Start the AgentTestRunner, which will start up the coding agent in a
49+
// pseudo-terminal, and wait for it to load the Firebase MCP server, and
50+
// start accepting keystrokes
51+
const run: AgentTestRunner = await startAgentTest(this);
52+
53+
// Simulate typing in the terminal. This will await until the "turn" is over
54+
// so any assertions on what happened will happen on the current "turn"
55+
await run.type("/firebase:init");
56+
// Assert that the agent outputted "Backend Services"
57+
await run.expectText("Backend Services");
58+
59+
await run.type("Use Firebase Project `project-id-1000`");
60+
// Assert that a tool was called with the given arguments, and that it was
61+
// successful
62+
await run.expectToolCalls([
63+
"firebase_update_environment",
64+
argumentContains: "project-id-1000",
65+
isSuccess: true,
66+
]);
67+
68+
// Important: Expectations apply to the last "turn". Each time you type, it
69+
// creates a new turn. This ensures you are only asserting against the most
70+
// recent actions of the agent
71+
await run.type("Hello world");
72+
// This will fail, because "Hello World" doesn't trigger a tool call
73+
await run.expectToolCalls([
74+
"firebase_update_environment",
75+
argumentContains: "project-id-1000",
76+
isSuccess: true,
77+
]);
78+
});
79+
});
80+
```

0 commit comments

Comments
 (0)