Skip to content

Commit 5259ce8

Browse files
gundermancYuna Seol
authored andcommitted
Slash command for helping in debugging (#17609)
1 parent bf9821b commit 5259ce8

File tree

2 files changed

+105
-3
lines changed

2 files changed

+105
-3
lines changed
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
description = "Check status of nightly evals, fix failures for key models, and re-run."
2+
prompt = """
3+
You are an expert at fixing behavioral evaluations.
4+
5+
1. **Investigate**:
6+
- Use 'gh' cli to fetch the results from the latest run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
7+
- DO NOT push any changes or start any runs. The rest of your evaluation will be local.
8+
- Evals are in evals/ directory and are documented by evals/README.md.
9+
- The test case trajectory logs will be logged to evals/logs.
10+
- You should also enable and review the verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable.
11+
- Identify the relevant test. Confine your investigation and validation to just this test.
12+
- Proactively add logging that will aid in gathering information or validating your hypotheses.
13+
14+
2. **Fix**:
15+
- If a relevant test is failing, locate the test file and the corresponding prompt/code.
16+
- It's often helpful to make an extreme, brute force change to see if you are changing the right place to make an improvement and then scope it back iteratively.
17+
- Your **final** change should be **minimal and targeted**.
18+
- Keep in mind the following:
19+
- The prompt has multiple configurations and pieces. Take care that your changes
20+
end up in the final prompt for the selected model and configuration.
21+
- The prompt chosen for the eval is intentional. It's often vague or indirect
22+
to see how the agent performs with ambiguous instructions. Changing it should
23+
be a last resort.
24+
- When changing the test prompt, carefully consider whether the prompt still tests
25+
the same scenario. We don't want to lose test fidelity by making the prompts too
26+
direct (i.e.: easy).
27+
- Your primary mechanism for improving the agent's behavior is to make changes to
28+
tool instructions, prompt.ts, and/or modules that contribute to the prompt.
29+
- If prompt and description changes are unsuccessful, use logs and debugging to
30+
confirm that everything is working as expected.
31+
- If unable to fix the test, you can make recommendations for architecture changes
32+
that might help stablize the test. Be sure to THINK DEEPLY if offering architecture guidance.
33+
Some facts that might help with this are:
34+
- Agents may be composed of one or more agent loops.
35+
- AgentLoop == 'context + toolset + prompt'. Subagents are one type of agent loop.
36+
- Agent loops perform better when:
37+
- They have direct, unambiguous, and non-contradictory prompts.
38+
- They have fewer irrelevant tools.
39+
- They have fewer goals or steps to perform.
40+
- They have less low value or irrelevant context.
41+
- You may suggest compositions of existing primitives, like subagents, or
42+
propose a new one.
43+
- These recommendations should be high confidence and should be grounded
44+
in observed deficient behaviors rather than just parroting the facts above.
45+
Investigate as needed to ground your recommendations.
46+
47+
3. **Verify**:
48+
- Run just that one test if needed to validate that it is fixed. Be sure to run vitest in non-interactive mode.
49+
- Running the tests can take a long time, so consider whether you can diagnose via other means or log diagnostics before committing the time. You must minimize the number of test runs needed to diagnose the failure.
50+
- After the test completes, check whether it seems to have improved.
51+
- You will need to run the test 3 times for Gemini 3.0, Gemini 3 flash, and Gemini 2.5 pro to ensure that it is truly stable. Run these runs in parallel, using scripts if needed.
52+
- Some flakiness is expected; if it looks like a transient issue or the test is inherently unstable but passes 2/3 times, you might decide it cannot be improved.
53+
54+
4. **Report**:
55+
- Provide a summary of the test success rate for each of the tested models.
56+
- Success rate is calculated based on 3 runs per model (e.g., 3/3 = 100%).
57+
- If you couldn't fix it due to persistent flakiness, explain why.
58+
59+
{{args}}
60+
"""

evals/README.md

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,48 @@ A significant drop in the pass rate for a `USUALLY_PASSES` test—even if it
144144
doesn't drop to 0%—often indicates that a recent change to a system prompt or
145145
tool definition has made the model's behavior less reliable.
146146

147-
You may be able to investigate the regression using Gemini CLI by giving it the
148-
link to the runs before and after the change and the name of the test and asking
149-
it to investigate what changes may have impacted the test.
147+
## Fixing Evaluations
148+
149+
If an evaluation is failing or has a regressed pass rate, you can use the
150+
`/fix-behavioral-eval` command within Gemini CLI to help investigate and fix the
151+
issue.
152+
153+
### `/fix-behavioral-eval`
154+
155+
This command is designed to automate the investigation and fixing process for
156+
failing evaluations. It will:
157+
158+
1. **Investigate**: Fetch the latest results from the nightly workflow using
159+
the `gh` CLI, identify the failing test, and review test trajectory logs in
160+
`evals/logs`.
161+
2. **Fix**: Suggest and apply targeted fixes to the prompt or tool definitions.
162+
It prioritizes minimal changes to `prompt.ts`, tool instructions, and
163+
modules that contribute to the prompt. It generally tries to avoid changing
164+
the test itself.
165+
3. **Verify**: Re-run the test 3 times across multiple models (e.g., Gemini
166+
3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a
167+
success rate.
168+
4. **Report**: Provide a summary of the success rate for each model and details
169+
on the applied fixes.
170+
171+
To use it, run:
172+
173+
```bash
174+
gemini /fix-behavioral-eval
175+
```
176+
177+
You can also provide a link to a specific GitHub Action run or the name of a
178+
specific test to focus the investigation:
179+
180+
```bash
181+
gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/runs/123456789
182+
```
183+
184+
When investigating failures manually, you can also enable verbose agent logs by
185+
setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
186+
187+
It's highly recommended to manually review and/or ask the agent to iterate on
188+
any prompt changes, even if they pass all evals. The prompt should prefer
189+
positive traits ('do X') and resort to negative traits ('do not do X') only when
190+
unable to accomplish the goal with positive traits. Gemini is quite good at
191+
instrospecting on its prompt when asked the right questions.

0 commit comments

Comments
 (0)