Merge pull request #249786 from ManoharLakkoju-MSFT/patch-69

prmerger-automator[bot] · web-flow · commit fff7721fbcea · 2023-09-13T05:39:07.000Z
(AzureCXP) fixes MicrosoftDocs/azure-docs#114094
diff --git a/articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md b/articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md
@@ -146,7 +146,7 @@ In Prompt flow, we provide multiple built-in evaluation methods to help you meas
 | Classification Accuracy Evaluation | Accuracy | Measures the performance of a classification system by comparing its outputs to ground truth. | No | prediction, ground truth | in the range [0, 1]. |
 | QnA Relevance Scores Pairwise Evaluation | Score, win/lose | Assesses the quality of answers generated by a question answering system. It involves assigning relevance scores to each answer based on how well it matches the user question, comparing different answers to a baseline answer, and aggregating the results to produce metrics such as averaged win rates and relevance scores. | Yes | question, answer (no ground truth or context)  | Score: 0-100, win/lose: 1/0 |
 | QnA Groundedness Evaluation | Groundedness | Measures how grounded the model's predicted answers are in the input source. Even if LLM’s responses are true, if not verifiable against source, then is ungrounded.  | Yes | question, answer, context (no ground truth)  | 1 to 5, with 1 being the worst and 5 being the best. |
-| QnA Ada Similarity Evaluation | Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer.  | Yes | question, answer, ground truth (context not needed)  | in the range [0, 1]. |
+| QnA GPT Similarity Evaluation | GPT Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer using GPT Model.  | Yes | question, answer, ground truth (context not needed)  | in the range [0, 1]. |
 | QnA Relevance Evaluation | Relevance | Measures how relevant the model's predicted answers are to the questions asked.  | Yes | question, answer, context (no ground truth)  | 1 to 5, with 1 being the worst and 5 being the best. |
 | QnA Coherence Evaluation | Coherence  | Measures the quality of all sentences in a model's predicted answer and how they fit together naturally.  | Yes | question, answer (no ground truth or context)  | 1 to 5, with 1 being the worst and 5 being the best. |
 | QnA Fluency Evaluation | Fluency  | Measures how grammatically and linguistically correct the model's predicted answer is.  | Yes | question, answer (no ground truth or context)  | 1 to 5, with 1 being the worst and 5 being the best |
@@ -174,4 +174,4 @@ In this document, you learned how to run a bulk test and use a built-in evaluati
 
 - [Develop a customized evaluation flow](how-to-develop-an-evaluation-flow.md)
 - [Tune prompts using variants](how-to-tune-prompts-using-variants.md)
-- [Deploy a flow](how-to-deploy-for-real-time-inference.md)
+- [Deploy a flow](how-to-deploy-for-real-time-inference.md)