Commit 68f9016
authored
fix(eval): repair eval-conflicts CLI and expand dataset to 100 pairs (#377)
The eval-conflicts CLI never worked end-to-end because all three HTTP
response decoders (auth token, validator eval, scorer eval) failed to
unwrap the server's {"data":...} envelope. Fix all three.
Expand the validator eval dataset from 63 to 100 labeled pairs by
adding 37 production-labeled conflicts (17 genuine, 16
related_not_contradicting, 4 unrelated_false_positive) from a
judge + meta-judge audit of real conflict data.
Baseline metrics on the expanded dataset:
- Scorer precision: 45.9% (embedding-only, no threshold separates classes)
- Validator (gpt-4o-mini) precision: 91.5%, recall: 95.6%, F1: 93.5%
- Relationship accuracy: 77.4% (5-class exact match)
Refs: #3761 parent 07ba1a6 commit 68f9016
2 files changed
+501
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
186 | 186 | | |
187 | 187 | | |
188 | 188 | | |
189 | | - | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
190 | 192 | | |
191 | 193 | | |
192 | 194 | | |
193 | 195 | | |
194 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
195 | 200 | | |
196 | 201 | | |
197 | 202 | | |
| |||
223 | 228 | | |
224 | 229 | | |
225 | 230 | | |
226 | | - | |
227 | | - | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
228 | 235 | | |
229 | 236 | | |
230 | | - | |
| 237 | + | |
231 | 238 | | |
232 | 239 | | |
233 | 240 | | |
| |||
254 | 261 | | |
255 | 262 | | |
256 | 263 | | |
257 | | - | |
258 | | - | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
259 | 268 | | |
260 | 269 | | |
261 | | - | |
| 270 | + | |
262 | 271 | | |
0 commit comments