Skip to content

Commit fff7721

Browse files
Merge pull request #249786 from ManoharLakkoju-MSFT/patch-69
(AzureCXP) fixes MicrosoftDocs/azure-docs#114094
2 parents 23149ef + 49256c3 commit fff7721

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

articles/machine-learning/prompt-flow/how-to-bulk-test-evaluate-flow.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ In Prompt flow, we provide multiple built-in evaluation methods to help you meas
146146
| Classification Accuracy Evaluation | Accuracy | Measures the performance of a classification system by comparing its outputs to ground truth. | No | prediction, ground truth | in the range [0, 1]. |
147147
| QnA Relevance Scores Pairwise Evaluation | Score, win/lose | Assesses the quality of answers generated by a question answering system. It involves assigning relevance scores to each answer based on how well it matches the user question, comparing different answers to a baseline answer, and aggregating the results to produce metrics such as averaged win rates and relevance scores. | Yes | question, answer (no ground truth or context) | Score: 0-100, win/lose: 1/0 |
148148
| QnA Groundedness Evaluation | Groundedness | Measures how grounded the model's predicted answers are in the input source. Even if LLM’s responses are true, if not verifiable against source, then is ungrounded. | Yes | question, answer, context (no ground truth) | 1 to 5, with 1 being the worst and 5 being the best. |
149-
| QnA Ada Similarity Evaluation | Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer. | Yes | question, answer, ground truth (context not needed) | in the range [0, 1]. |
149+
| QnA GPT Similarity Evaluation | GPT Similarity | Measures similarity between user-provided ground truth answers and the model predicted answer using GPT Model. | Yes | question, answer, ground truth (context not needed) | in the range [0, 1]. |
150150
| QnA Relevance Evaluation | Relevance | Measures how relevant the model's predicted answers are to the questions asked. | Yes | question, answer, context (no ground truth) | 1 to 5, with 1 being the worst and 5 being the best. |
151151
| QnA Coherence Evaluation | Coherence | Measures the quality of all sentences in a model's predicted answer and how they fit together naturally. | Yes | question, answer (no ground truth or context) | 1 to 5, with 1 being the worst and 5 being the best. |
152152
| QnA Fluency Evaluation | Fluency | Measures how grammatically and linguistically correct the model's predicted answer is. | Yes | question, answer (no ground truth or context) | 1 to 5, with 1 being the worst and 5 being the best |
@@ -174,4 +174,4 @@ In this document, you learned how to run a bulk test and use a built-in evaluati
174174

175175
- [Develop a customized evaluation flow](how-to-develop-an-evaluation-flow.md)
176176
- [Tune prompts using variants](how-to-tune-prompts-using-variants.md)
177-
- [Deploy a flow](how-to-deploy-for-real-time-inference.md)
177+
- [Deploy a flow](how-to-deploy-for-real-time-inference.md)

0 commit comments

Comments
 (0)