Merge pull request #393 from ShayakSarkar/patch-1

v-dirichards · web-flow · commit cc3ac656dbbe · 2025-06-10T15:13:38.000-05:00
Possible error in evaluator pass / fail condition
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -70,7 +70,7 @@ intent_resolution(
 
 ### Intent resolution output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -137,7 +137,7 @@ tool_call_accuracy(
 
 ### Tool call accuracy output
 
-The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
+The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -174,7 +174,7 @@ task_adherence(
 
 ### Task adherence output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/general-purpose-evaluators.md
@@ -59,7 +59,7 @@ coherence(
 
 ### Coherence output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -88,7 +88,7 @@ fluency(
 
 ### Fluency output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -127,7 +127,7 @@ qa_eval(
 
 ### QA output
 
-While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
@@ -63,7 +63,7 @@ retrieval(
 
 ### Retrieval output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -163,7 +163,7 @@ document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retr
 
 ### Document retrieval output
 
-All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. 
+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. 
 
 ```python
 {
@@ -206,7 +206,7 @@ groundedness(
 
 ### Groundedness output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -276,7 +276,7 @@ relevance(
 
 ### Relevance output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -306,7 +306,7 @@ response_completeness(
 
 ### Response completeness output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
@@ -58,7 +58,7 @@ similarity(
 
 ### Similarity output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -87,7 +87,7 @@ f1_score(
 
 ### F1 score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -115,7 +115,7 @@ bleu_score(
 
 ### BLEU output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -144,7 +144,7 @@ gleu_score(
 
 ### GLEU score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -173,7 +173,7 @@ rouge(
 
 ### ROUGE score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -208,7 +208,7 @@ meteor_score(
 
 ### METEOR score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {

Original file line number	Diff line number	Diff line change
`@@ -70,7 +70,7 @@ intent_resolution(`
`70`	`70`
`71`	`71`	`### Intent resolution output`
`72`	`72`
`73`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.`
	`73`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.`
`74`	`74`
`75`	`75`	```python
`76`	`76`	`{`
`@@ -137,7 +137,7 @@ tool_call_accuracy(`
`137`	`137`
`138`	`138`	`### Tool call accuracy output`
`139`	`139`
`140`		`-The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.`
	`140`	`+The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.`
`141`	`141`
`142`	`142`	```python
`143`	`143`	`{`
`@@ -174,7 +174,7 @@ task_adherence(`
`174`	`174`
`175`	`175`	`### Task adherence output`
`176`	`176`
`177`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`177`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`178`	`178`
`179`	`179`	```python
`180`	`180`	`{`
Original file line number	Diff line number	Diff line change
`@@ -59,7 +59,7 @@ coherence(`
`59`	`59`
`60`	`60`	`### Coherence output`
`61`	`61`
`62`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`62`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`63`	`63`
`64`	`64`	```python
`65`	`65`	`{`
`@@ -88,7 +88,7 @@ fluency(`
`88`	`88`
`89`	`89`	`### Fluency output`
`90`	`90`
`91`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`91`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`92`	`92`
`93`	`93`	```python
`94`	`94`	`{`
`@@ -127,7 +127,7 @@ qa_eval(`
`127`	`127`
`128`	`128`	`### QA output`
`129`	`129`
`130`		`-While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`130`	`+While F1 score outputs a numerical score on 0-1 float scale, the other evaluators output numerical scores on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`131`	`131`
`132`	`132`	```python
`133`	`133`	`{`
Original file line number	Diff line number	Diff line change
`@@ -63,7 +63,7 @@ retrieval(`
`63`	`63`
`64`	`64`	`### Retrieval output`
`65`	`65`
`66`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`66`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`67`	`67`
`68`	`68`	```python
`69`	`69`	`{`
`@@ -163,7 +163,7 @@ document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retr`
`163`	`163`
`164`	`164`	`### Document retrieval output`
`165`	`165`
`166`		-All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise.
	`166`	+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise.
`167`	`167`
`168`	`168`	```python
`169`	`169`	`{`
`@@ -206,7 +206,7 @@ groundedness(`
`206`	`206`
`207`	`207`	`### Groundedness output`
`208`	`208`
`209`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`209`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`210`	`210`
`211`	`211`	```python
`212`	`212`	`{`
`@@ -276,7 +276,7 @@ relevance(`
`276`	`276`
`277`	`277`	`### Relevance output`
`278`	`278`
`279`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`279`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`280`	`280`
`281`	`281`	```python
`282`	`282`	`{`
`@@ -306,7 +306,7 @@ response_completeness(`
`306`	`306`
`307`	`307`	`### Response completeness output`
`308`	`308`
`309`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`309`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`310`	`310`
`311`	`311`	```python
`312`	`312`	`{`
Original file line number	Diff line number	Diff line change
`@@ -58,7 +58,7 @@ similarity(`
`58`	`58`
`59`	`59`	`### Similarity output`
`60`	`60`
`61`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`61`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`62`	`62`
`63`	`63`	```python
`64`	`64`	`{`
`@@ -87,7 +87,7 @@ f1_score(`
`87`	`87`
`88`	`88`	`### F1 score output`
`89`	`89`
`90`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`90`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`91`	`91`
`92`	`92`	```python
`93`	`93`	`{`
`@@ -115,7 +115,7 @@ bleu_score(`
`115`	`115`
`116`	`116`	`### BLEU output`
`117`	`117`
`118`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`118`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`119`	`119`
`120`	`120`	```python
`121`	`121`	`{`
`@@ -144,7 +144,7 @@ gleu_score(`
`144`	`144`
`145`	`145`	`### GLEU score output`
`146`	`146`
`147`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`147`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`148`	`148`
`149`	`149`	```python
`150`	`150`	`{`
`@@ -173,7 +173,7 @@ rouge(`
`173`	`173`
`174`	`174`	`### ROUGE score output`
`175`	`175`
`176`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`176`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`177`	`177`
`178`	`178`	```python
`179`	`179`	`{`
`@@ -208,7 +208,7 @@ meteor_score(`
`208`	`208`
`209`	`209`	`### METEOR score output`
`210`	`210`
`211`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`211`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`212`	`212`
`213`	`213`	```python
`214`	`214`	`{`