You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,7 +146,7 @@ In Prompt flow, we provide multiple built-in evaluation methods to help you meas
146
146
| Classification Accuracy Evaluation | Accuracy | Measures the performance of a classification system by comparing its outputs to ground truth. | No | prediction, ground truth | in the range [0, 1]. |
147
147
| QnA Relevance Scores Pairwise Evaluation | Score, win/lose | Assesses the quality of answers generated by a question answering system. It involves assigning relevance scores to each answer based on how well it matches the user question, comparing different answers to a baseline answer, and aggregating the results to produce metrics such as averaged win rates and relevance scores. | Yes | question, answer (no ground truth or context) | Score: 0-100, win/lose: 1/0 |
148
148
| QnA Groundedness Evaluation | Groundedness | Measures how grounded the model's predicted answers are in the input source. Even if LLM’s responses are true, if not verifiable against source, then is ungrounded. | Yes | question, answer, context (no ground truth) | 1 to 5, with 1 being the worst and 5 being the best. |
149
-
| QnA Ada Similarity Evaluation | Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer. | Yes | question, answer, ground truth (context not needed) | in the range [0, 1]. |
149
+
| QnA GPT Similarity Evaluation |GPT Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer using GPT Model. | Yes | question, answer, ground truth (context not needed) | in the range [0, 1]. |
150
150
| QnA Relevance Evaluation | Relevance | Measures how relevant the model's predicted answers are to the questions asked. | Yes | question, answer, context (no ground truth) | 1 to 5, with 1 being the worst and 5 being the best. |
151
151
| QnA Coherence Evaluation | Coherence | Measures the quality of all sentences in a model's predicted answer and how they fit together naturally. | Yes | question, answer (no ground truth or context) | 1 to 5, with 1 being the worst and 5 being the best. |
152
152
| QnA Fluency Evaluation | Fluency | Measures how grammatically and linguistically correct the model's predicted answer is. | Yes | question, answer (no ground truth or context) | 1 to 5, with 1 being the worst and 5 being the best |
@@ -174,4 +174,4 @@ In this document, you learned how to run a bulk test and use a built-in evaluati
174
174
175
175
-[Develop a customized evaluation flow](how-to-develop-an-evaluation-flow.md)
176
176
-[Tune prompts using variants](how-to-tune-prompts-using-variants.md)
177
-
-[Deploy a flow](how-to-deploy-for-real-time-inference.md)
177
+
-[Deploy a flow](how-to-deploy-for-real-time-inference.md)
0 commit comments