You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The result of the AI-assisted quality evaluators for a query and response pair is a dictionary containing:
296
297
297
-
-`{metric_name}` provides a numerical score.
298
-
-`{metric_name}_label` provides a binary label.
298
+
-`{metric_name}` provides a numerical score, on a likert scale (integer 1 to 5) or a float between 0-1.
299
+
-`{metric_name}_label` provides a binary label (if the metric outputs a binary score naturally).
299
300
-`{metric_name}_reason` explains why a certain score or label was given for each data point.
300
301
302
+
To further improve intelligibility, all evaluators accept a binary threshold and output two new keys. For the binarization threshold, a default is set and user can override it. The two new keys are:
303
+
304
+
-`{metric_name}_result` a "pass" or "fail" string based on a binarization threshold.
305
+
-`{metric_name}_threshold` a numerical binarization threshold set by default or by the user
306
+
307
+
308
+
301
309
#### Comparing quality and custom evaluators
302
310
303
311
For NLP evaluators, only a score is given in the `{metric_name}` key.
0 commit comments