Other possible errors

Shayak Sarkar · Shayak Sarkar · commit 35b52641484a · 2025-05-24T23:16:20.000+05:30
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/agent-evaluators.md
@@ -67,7 +67,7 @@ intent_resolution(
 
 ### Intent resolution output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -134,7 +134,7 @@ tool_call_accuracy(
 
 ### Tool call accuracy output
 
-The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
+The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
 
 ```python
 {
@@ -171,7 +171,7 @@ task_adherence(
 
 ### Task adherence output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/rag-evaluators.md
@@ -60,7 +60,7 @@ retrieval(
 
 ### Retrieval output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -160,7 +160,7 @@ document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retr
 
 ### Document retrieval output
 
-All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. 
+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. 
 
 ```python
 {
@@ -203,7 +203,7 @@ groundedness(
 
 ### Groundedness output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -273,7 +273,7 @@ relevance(
 
 ### Relevance output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -303,7 +303,7 @@ response_completeness(
 
 ### Response completeness output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
diff --git a/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md b/articles/ai-foundry/concepts/evaluation-evaluators/textual-similarity-evaluators.md
@@ -55,7 +55,7 @@ similarity(
 
 ### Similarity output
 
-The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
+The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
 
 ```python
 {
@@ -84,7 +84,7 @@ f1_score(
 
 ### F1 score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -112,7 +112,7 @@ bleu_score(
 
 ### BLEU output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -141,7 +141,7 @@ gleu_score(
 
 ### GLEU score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -170,7 +170,7 @@ rouge(
 
 ### ROUGE score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {
@@ -205,7 +205,7 @@ meteor_score(
 
 ### METEOR score output
 
-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.
+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
 
 ```python
 {

Original file line number	Diff line number	Diff line change
`@@ -67,7 +67,7 @@ intent_resolution(`
`67`	`67`
`68`	`68`	`### Intent resolution output`
`69`	`69`
`70`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.`
	`70`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.`
`71`	`71`
`72`	`72`	```python
`73`	`73`	`{`
`@@ -134,7 +134,7 @@ tool_call_accuracy(`
`134`	`134`
`135`	`135`	`### Tool call accuracy output`
`136`	`136`
`137`		`-The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.`
	`137`	`+The numerical score (passing rate of correct tool calls) is 0-1 and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.`
`138`	`138`
`139`	`139`	```python
`140`	`140`	`{`
`@@ -171,7 +171,7 @@ task_adherence(`
`171`	`171`
`172`	`172`	`### Task adherence output`
`173`	`173`
`174`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`174`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`175`	`175`
`176`	`176`	```python
`177`	`177`	`{`
Original file line number	Diff line number	Diff line change
`@@ -60,7 +60,7 @@ retrieval(`
`60`	`60`
`61`	`61`	`### Retrieval output`
`62`	`62`
`63`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`63`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`64`	`64`
`65`	`65`	```python
`66`	`66`	`{`
`@@ -160,7 +160,7 @@ document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retr`
`160`	`160`
`161`	`161`	`### Document retrieval output`
`162`	`162`
`163`		-All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise.
	`163`	+All numerical scores have `high_is_better=True` except for `holes` and `holes_ratio` which have `high_is_better=False`. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise.
`164`	`164`
`165`	`165`	```python
`166`	`166`	`{`
`@@ -203,7 +203,7 @@ groundedness(`
`203`	`203`
`204`	`204`	`### Groundedness output`
`205`	`205`
`206`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`206`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`207`	`207`
`208`	`208`	```python
`209`	`209`	`{`
`@@ -273,7 +273,7 @@ relevance(`
`273`	`273`
`274`	`274`	`### Relevance output`
`275`	`275`
`276`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`276`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`277`	`277`
`278`	`278`	```python
`279`	`279`	`{`
`@@ -303,7 +303,7 @@ response_completeness(`
`303`	`303`
`304`	`304`	`### Response completeness output`
`305`	`305`
`306`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`306`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`307`	`307`
`308`	`308`	```python
`309`	`309`	`{`
Original file line number	Diff line number	Diff line change
`@@ -55,7 +55,7 @@ similarity(`
`55`	`55`
`56`	`56`	`### Similarity output`
`57`	`57`
`58`		`-The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score <= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
	`58`	`+The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.`
`59`	`59`
`60`	`60`	```python
`61`	`61`	`{`
`@@ -84,7 +84,7 @@ f1_score(`
`84`	`84`
`85`	`85`	`### F1 score output`
`86`	`86`
`87`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`87`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`88`	`88`
`89`	`89`	```python
`90`	`90`	`{`
`@@ -112,7 +112,7 @@ bleu_score(`
`112`	`112`
`113`	`113`	`### BLEU output`
`114`	`114`
`115`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`115`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`116`	`116`
`117`	`117`	```python
`118`	`118`	`{`
`@@ -141,7 +141,7 @@ gleu_score(`
`141`	`141`
`142`	`142`	`### GLEU score output`
`143`	`143`
`144`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`144`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`145`	`145`
`146`	`146`	```python
`147`	`147`	`{`
`@@ -170,7 +170,7 @@ rouge(`
`170`	`170`
`171`	`171`	`### ROUGE score output`
`172`	`172`
`173`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`173`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`174`	`174`
`175`	`175`	```python
`176`	`176`	`{`
`@@ -205,7 +205,7 @@ meteor_score(`
`205`	`205`
`206`	`206`	`### METEOR score output`
`207`	`207`
`208`		`-The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score <= threshold, or "fail" otherwise.`
	`208`	`+The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.`
`209`	`209`
`210`	`210`	```python
`211`	`211`	`{`