|
| 1 | +# API Review Command Eval Test Suite |
| 2 | + |
| 3 | +Design document for a Go/Ginkgo-based evaluation framework to test the `/api-review` Claude command against known API review scenarios. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This test suite validates that the `/api-review` Claude command correctly identifies API documentation issues. Each test case consists of a patch file and expected issues. The suite applies patches to a clean clone of the repository, runs the API review command, and verifies the output matches expectations exactly (no missing issues, no hallucinated issues). |
| 8 | + |
| 9 | +## Directory Structure |
| 10 | + |
| 11 | +``` |
| 12 | +tests/eval/ |
| 13 | +├── eval_test.go # Main Ginkgo test suite |
| 14 | +├── DESIGN.md # This file |
| 15 | +├── testdata/ |
| 16 | +│ ├── missing-optional-doc/ |
| 17 | +│ │ ├── patch.diff # Git patch to apply |
| 18 | +│ │ └── expected.txt # Expected issues (one per line) |
| 19 | +│ ├── undocumented-enum/ |
| 20 | +│ │ ├── patch.diff |
| 21 | +│ │ └── expected.txt |
| 22 | +│ └── valid-api-change/ |
| 23 | +│ ├── patch.diff |
| 24 | +│ └── expected.txt # Empty file = no issues expected |
| 25 | +``` |
| 26 | + |
| 27 | +## Test Case Format |
| 28 | + |
| 29 | +### patch.diff |
| 30 | + |
| 31 | +Standard git diff format: |
| 32 | + |
| 33 | +```diff |
| 34 | +diff --git a/config/v1/types.go b/config/v1/types.go |
| 35 | +--- a/config/v1/types.go |
| 36 | ++++ b/config/v1/types.go |
| 37 | +@@ -10,0 +11,5 @@ |
| 38 | ++// MyField does something |
| 39 | ++// +optional |
| 40 | ++// +kubebuilder:validation:Enum=Foo;Bar |
| 41 | ++MyField string `json:"myField"` |
| 42 | +``` |
| 43 | + |
| 44 | +### expected.txt |
| 45 | + |
| 46 | +One expected issue per line: |
| 47 | + |
| 48 | +``` |
| 49 | +enum values Foo and Bar not documented in comment |
| 50 | +optional field does not explain behavior when omitted |
| 51 | +``` |
| 52 | + |
| 53 | +Empty file means the API change should pass review with no issues. |
| 54 | + |
| 55 | +**Note**: Order of issues in `expected.txt` does not matter. Comparison uses semantic matching, not exact string matching. |
| 56 | + |
| 57 | +## Test Flow |
| 58 | + |
| 59 | +``` |
| 60 | +┌─────────────────────────────────────────────────────────────┐ |
| 61 | +│ 1. Pre-flight: │ |
| 62 | +│ a. Verify local AGENTS.md and │ |
| 63 | +│ .claude/commands/api-review.md exist │ |
| 64 | +│ b. These will be copied to temp dir after clone │ |
| 65 | +│ (ensures local changes are tested) │ |
| 66 | +├─────────────────────────────────────────────────────────────┤ |
| 67 | +│ 2. Setup (once): │ |
| 68 | +│ a. Shallow clone openshift/api to temp dir │ |
| 69 | +│ b. Copy local AGENTS.md and .claude/ to temp dir │ |
| 70 | +├─────────────────────────────────────────────────────────────┤ |
| 71 | +│ 3. For each test case (sequential): │ |
| 72 | +│ a. Reset repo to clean state │ |
| 73 | +│ b. Apply patch.diff │ |
| 74 | +│ c. Run claude api-review on changed files │ |
| 75 | +│ d. Run claude to compare output vs expected.txt │ |
| 76 | +│ e. Parse true/false response, assert │ |
| 77 | +├─────────────────────────────────────────────────────────────┤ |
| 78 | +│ 4. Teardown: Remove temp dir │ |
| 79 | +└─────────────────────────────────────────────────────────────┘ |
| 80 | +``` |
| 81 | + |
| 82 | +## Reset Between Tests |
| 83 | + |
| 84 | +```bash |
| 85 | +git reset --hard origin/master && git clean -fd |
| 86 | +``` |
| 87 | + |
| 88 | +- `git reset --hard origin/master`: Resets all tracked files to match the remote master branch, discarding any local commits and staged/unstaged changes |
| 89 | +- `git clean -f`: Force remove untracked files (files not in git) |
| 90 | +- `git clean -d`: Also remove untracked directories |
| 91 | + |
| 92 | +After reset, re-copy `AGENTS.md` and `.claude/` from local source to preserve local modifications being tested. |
| 93 | + |
| 94 | +**Remote origin handling**: The shallow clone creates `origin` automatically. Before reset, verify remote exists: |
| 95 | + |
| 96 | +```bash |
| 97 | +git remote get-url origin || git remote add origin https://github.com/openshift/api.git |
| 98 | +``` |
| 99 | + |
| 100 | +## Claude Invocations |
| 101 | + |
| 102 | +### Step 1 - Run API Review |
| 103 | + |
| 104 | +```bash |
| 105 | +claude --print -p "/api-review" --allowedTools "Bash,Read,Grep,Glob,Task" <files> |
| 106 | +``` |
| 107 | + |
| 108 | +### Step 2 - Compare Results |
| 109 | + |
| 110 | +```bash |
| 111 | +claude --print -p "Given this API review output: |
| 112 | +<output> |
| 113 | +
|
| 114 | +Expected issues (one per line): |
| 115 | +<contents of expected.txt> |
| 116 | +
|
| 117 | +Compare the review output against the expected issues list. |
| 118 | +The result is 'true' ONLY if: |
| 119 | +1. ALL expected issues are identified in the output |
| 120 | +2. NO additional issues are reported beyond what is expected |
| 121 | +
|
| 122 | +If any expected issue is missing, reply 'false'. |
| 123 | +If any issue is reported that is NOT in the expected list, reply 'false'. |
| 124 | +
|
| 125 | +Reply with exactly 'true' or 'false' (no other text)." |
| 126 | +``` |
| 127 | + |
| 128 | +Parse response, trim whitespace, check for `true` or `false`. |
| 129 | + |
| 130 | +## Pre-flight Check |
| 131 | + |
| 132 | +Before cloning, verify these local files exist in the source repo: |
| 133 | + |
| 134 | +- `AGENTS.md` |
| 135 | +- `.claude/commands/api-review.md` |
| 136 | + |
| 137 | +These are copied into the temp clone so that any local modifications to the review command are tested, not the remote versions. |
| 138 | + |
| 139 | +## Configuration |
| 140 | + |
| 141 | +| Setting | Value | |
| 142 | +|---------|-------| |
| 143 | +| Timeout per Claude call | 5 minutes | |
| 144 | +| Execution mode | Sequential | |
| 145 | +| Clone depth | 1 (shallow) | |
| 146 | +| Clone source | `https://github.com/openshift/api.git` | |
| 147 | +| Reset between tests | Verify origin remote exists, `git reset --hard origin/master && git clean -fd`, re-copy local AGENTS.md and .claude/ | |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Phase 2 (Future Work) |
| 152 | + |
| 153 | +### Patch Stability |
| 154 | + |
| 155 | +Patches may fail to apply as `origin/master` evolves over time. Need a strategy to handle this (e.g., pinning to a specific commit). |
| 156 | + |
| 157 | +### Error Handling |
| 158 | + |
| 159 | +Current design does not address failure scenarios: |
| 160 | + |
| 161 | +- Patch application failures |
| 162 | +- Claude CLI timeouts or crashes |
| 163 | +- Empty or malformed output from Claude |
| 164 | +- Authentication failures |
| 165 | +- Resource cleanup on test failures |
0 commit comments