You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description = "Check status of nightly evals, fix failures for key models, and re-run."
2
+
prompt = """
3
+
You are an expert at fixing behavioral evaluations.
4
+
5
+
1. **Investigate**:
6
+
- Use 'gh' cli to fetch the results from the latest run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
7
+
- DO NOT push any changes or start any runs. The rest of your evaluation will be local.
8
+
- Evals are in evals/ directory and are documented by evals/README.md.
9
+
- The test case trajectory logs will be logged to evals/logs.
10
+
- You should also enable and review the verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable.
11
+
- Identify the relevant test. Confine your investigation and validation to just this test.
12
+
- Proactively add logging that will aid in gathering information or validating your hypotheses.
13
+
14
+
2. **Fix**:
15
+
- If a relevant test is failing, locate the test file and the corresponding prompt/code.
16
+
- It's often helpful to make an extreme, brute force change to see if you are changing the right place to make an improvement and then scope it back iteratively.
17
+
- Your **final** change should be **minimal and targeted**.
18
+
- Keep in mind the following:
19
+
- The prompt has multiple configurations and pieces. Take care that your changes
20
+
end up in the final prompt for the selected model and configuration.
21
+
- The prompt chosen for the eval is intentional. It's often vague or indirect
22
+
to see how the agent performs with ambiguous instructions. Changing it should
23
+
be a last resort.
24
+
- When changing the test prompt, carefully consider whether the prompt still tests
25
+
the same scenario. We don't want to lose test fidelity by making the prompts too
26
+
direct (i.e.: easy).
27
+
- Your primary mechanism for improving the agent's behavior is to make changes to
28
+
tool instructions, prompt.ts, and/or modules that contribute to the prompt.
29
+
- If prompt and description changes are unsuccessful, use logs and debugging to
30
+
confirm that everything is working as expected.
31
+
- If unable to fix the test, you can make recommendations for architecture changes
32
+
that might help stablize the test. Be sure to THINK DEEPLY if offering architecture guidance.
33
+
Some facts that might help with this are:
34
+
- Agents may be composed of one or more agent loops.
35
+
- AgentLoop == 'context + toolset + prompt'. Subagents are one type of agent loop.
36
+
- Agent loops perform better when:
37
+
- They have direct, unambiguous, and non-contradictory prompts.
38
+
- They have fewer irrelevant tools.
39
+
- They have fewer goals or steps to perform.
40
+
- They have less low value or irrelevant context.
41
+
- You may suggest compositions of existing primitives, like subagents, or
42
+
propose a new one.
43
+
- These recommendations should be high confidence and should be grounded
44
+
in observed deficient behaviors rather than just parroting the facts above.
45
+
Investigate as needed to ground your recommendations.
46
+
47
+
3. **Verify**:
48
+
- Run just that one test if needed to validate that it is fixed. Be sure to run vitest in non-interactive mode.
49
+
- Running the tests can take a long time, so consider whether you can diagnose via other means or log diagnostics before committing the time. You must minimize the number of test runs needed to diagnose the failure.
50
+
- After the test completes, check whether it seems to have improved.
51
+
- You will need to run the test 3 times for Gemini 3.0, Gemini 3 flash, and Gemini 2.5 pro to ensure that it is truly stable. Run these runs in parallel, using scripts if needed.
52
+
- Some flakiness is expected; if it looks like a transient issue or the test is inherently unstable but passes 2/3 times, you might decide it cannot be improved.
53
+
54
+
4. **Report**:
55
+
- Provide a summary of the test success rate for each of the tested models.
56
+
- Success rate is calculated based on 3 runs per model (e.g., 3/3 = 100%).
57
+
- If you couldn't fix it due to persistent flakiness, explain why.
0 commit comments