Slash command for helping in debugging (#17609)

gundermanc · web-flow · commit 5cf06503c84d · 2026-01-27T02:47:04.000Z
diff --git a/.gemini/commands/fix-behavioral-eval.toml b/.gemini/commands/fix-behavioral-eval.toml
@@ -0,0 +1,60 @@
+description = "Check status of nightly evals, fix failures for key models, and re-run."
+prompt = """
+You are an expert at fixing behavioral evaluations.
+
+1. **Investigate**:
+   - Use 'gh' cli to fetch the results from the latest run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
+   - DO NOT push any changes or start any runs. The rest of your evaluation will be local.
+   - Evals are in evals/ directory and are documented by evals/README.md.
+   - The test case trajectory logs will be logged to evals/logs.
+   - You should also enable and review the verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable.
+   - Identify the relevant test. Confine your investigation and validation to just this test.
+   - Proactively add logging that will aid in gathering information or validating your hypotheses.
+
+2. **Fix**:
+   - If a relevant test is failing, locate the test file and the corresponding prompt/code.
+   - It's often helpful to make an extreme, brute force change to see if you are changing the right place to make an improvement and then scope it back iteratively.
+   - Your **final** change should be **minimal and targeted**.
+   - Keep in mind the following:
+     - The prompt has multiple configurations and pieces. Take care that your changes
+       end up in the final prompt for the selected model and configuration.
+     - The prompt chosen for the eval is intentional. It's often vague or indirect
+       to see how the agent performs with ambiguous instructions. Changing it should
+       be a last resort.
+     - When changing the test prompt, carefully consider whether the prompt still tests
+       the same scenario. We don't want to lose test fidelity by making the prompts too
+       direct (i.e.: easy).
+     - Your primary mechanism for improving the agent's behavior is to make changes to
+       tool instructions, prompt.ts, and/or modules that contribute to the prompt.
+     - If prompt and description changes are unsuccessful, use logs and debugging to
+       confirm that everything is working as expected.
+    - If unable to fix the test, you can make recommendations for architecture changes
+      that might help stablize the test. Be sure to THINK DEEPLY if offering architecture guidance.
+      Some facts that might help with this are:
+      - Agents may be composed of one or more agent loops.
+      - AgentLoop == 'context + toolset + prompt'. Subagents are one type of agent loop.
+      - Agent loops perform better when:
+        - They have direct, unambiguous, and non-contradictory prompts.
+        - They have fewer irrelevant tools.
+        - They have fewer goals or steps to perform.
+        - They have less low value or irrelevant context.
+      - You may suggest compositions of existing primitives, like subagents, or
+        propose a new one.
+      - These recommendations should be high confidence and should be grounded
+        in observed deficient behaviors rather than just parroting the facts above.
+        Investigate as needed to ground your recommendations.
+
+3. **Verify**:
+   - Run just that one test if needed to validate that it is fixed. Be sure to run vitest in non-interactive mode.
+   - Running the tests can take a long time, so consider whether you can diagnose via other means or log diagnostics before committing the time. You must minimize the number of test runs needed to diagnose the failure.
+   - After the test completes, check whether it seems to have improved.
+   - You will need to run the test 3 times for Gemini 3.0, Gemini 3 flash, and Gemini 2.5 pro to ensure that it is truly stable. Run these runs in parallel, using scripts if needed.
+   - Some flakiness is expected; if it looks like a transient issue or the test is inherently unstable but passes 2/3 times, you might decide it cannot be improved.
+
+4. **Report**:
+   - Provide a summary of the test success rate for each of the tested models.
+   - Success rate is calculated based on 3 runs per model (e.g., 3/3 = 100%).
+   - If you couldn't fix it due to persistent flakiness, explain why.
+
+{{args}}
+"""
diff --git a/evals/README.md b/evals/README.md
@@ -144,6 +144,48 @@ A significant drop in the pass rate for a `USUALLY_PASSES` test—even if it
 doesn't drop to 0%—often indicates that a recent change to a system prompt or
 tool definition has made the model's behavior less reliable.
 
-You may be able to investigate the regression using Gemini CLI by giving it the
-link to the runs before and after the change and the name of the test and asking
-it to investigate what changes may have impacted the test.
+## Fixing Evaluations
+
+If an evaluation is failing or has a regressed pass rate, you can use the
+`/fix-behavioral-eval` command within Gemini CLI to help investigate and fix the
+issue.
+
+### `/fix-behavioral-eval`
+
+This command is designed to automate the investigation and fixing process for
+failing evaluations. It will:
+
+1.  **Investigate**: Fetch the latest results from the nightly workflow using
+    the `gh` CLI, identify the failing test, and review test trajectory logs in
+    `evals/logs`.
+2.  **Fix**: Suggest and apply targeted fixes to the prompt or tool definitions.
+    It prioritizes minimal changes to `prompt.ts`, tool instructions, and
+    modules that contribute to the prompt. It generally tries to avoid changing
+    the test itself.
+3.  **Verify**: Re-run the test 3 times across multiple models (e.g., Gemini
+    3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a
+    success rate.
+4.  **Report**: Provide a summary of the success rate for each model and details
+    on the applied fixes.
+
+To use it, run:
+
+```bash
+gemini /fix-behavioral-eval
+```
+
+You can also provide a link to a specific GitHub Action run or the name of a
+specific test to focus the investigation:
+
+```bash
+gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/runs/123456789
+```
+
+When investigating failures manually, you can also enable verbose agent logs by
+setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
+
+It's highly recommended to manually review and/or ask the agent to iterate on
+any prompt changes, even if they pass all evals. The prompt should prefer
+positive traits ('do X') and resort to negative traits ('do not do X') only when
+unable to accomplish the goal with positive traits. Gemini is quite good at
+instrospecting on its prompt when asked the right questions.