You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-studio/concepts/evaluation-approach-gen-ai.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,7 +80,7 @@ Cheat sheet:
80
80
81
81
| Purpose | Process | Parameters |
82
82
| -----| -----| ----|
83
-
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))<br> </br> - [Safety and Security](./evaluation-metrics-built-in.md?tabs=warning#risk-and-safety-metrics) ([Safety and Security sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py)) <br> </br> - [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) |
83
+
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))<br> </br> - [Safety and Security](./evaluation-metrics-built-in.md?#risk-and-safety-evaluators) ([Safety and Security sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py)) <br> </br> - [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) |
84
84
| What data should you use? | Upload or generate relevant dataset |[Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ([Generic simulator sample notebook](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)) <br></br> - [Adversarial simulator for measuring Safety and Security](../how-to/develop/simulator-interaction-data.md) ([Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb))|
85
85
| What resources should conduct the evaluation? | Run evaluation | - Local run <br> </br> - Remote cloud run |
86
86
| How did my model/app perform? | Analyze results |[View aggregate scores, view details, score details, compare evaluation runs](..//how-to/evaluate-results.md)|
**Definition**: The retrieved context chunks aren't relevant to the query despite any conceptual similarities. There's no overlap between the query and the retrieved information, and no useful chunks appear in the results. They introduce external knowledge that isn't part of the retrieval documents.
284
+
-**Definition**: The retrieved context chunks aren't relevant to the query despite any conceptual similarities. There's no overlap between the query and the retrieved information, and no useful chunks appear in the results. They introduce external knowledge that isn't part of the retrieval documents.
**Definition**: The context chunks are partially relevant to address the query but are mostly irrelevant, and external knowledge or LLM bias starts influencing the context chunks. The most relevant chunks are either missing or placed at the bottom.
286
+
-**Definition**: The context chunks are partially relevant to address the query but are mostly irrelevant, and external knowledge or LLM bias starts influencing the context chunks. The most relevant chunks are either missing or placed at the bottom.
**Definition**: The context chunks contain relevant information to address the query, but the most pertinent chunks are located at the bottom of the list.
288
+
-**Definition**: The context chunks contain relevant information to address the query, but the most pertinent chunks are located at the bottom of the list.
289
289
-**[Retrieval: 4] (Relevant Context Ranked Middle, No External Knowledge Bias and Factual Accuracy Ignored)**
290
-
**Definition**: The context chunks fully address the query, but the most relevant chunk is ranked in the middle of the list. No external knowledge is used to influence the ranking of the chunks; the system only relies on the provided context. Factual accuracy remains out of scope for evaluation.
290
+
-**Definition**: The context chunks fully address the query, but the most relevant chunk is ranked in the middle of the list. No external knowledge is used to influence the ranking of the chunks; the system only relies on the provided context. Factual accuracy remains out of scope for evaluation.
291
291
-**[Retrieval: 5] (Highly Relevant, Well Ranked, No Bias Introduced)**
292
-
**Definition**: The context chunks not only fully address the query, but also surface the most relevant chunks at the top of the list. The retrieval respects the internal context, avoids relying on any outside knowledge, and focuses solely on pulling the most useful content to the forefront, irrespective of the factual correctness of the information.
292
+
-**Definition**: The context chunks not only fully address the query, but also surface the most relevant chunks at the top of the list. The retrieval respects the internal context, avoids relying on any outside knowledge, and focuses solely on pulling the most useful content to the forefront, irrespective of the factual correctness of the information.
293
293
294
294
### AI-assisted: Relevance
295
295
@@ -310,15 +310,15 @@ Relevance refers to how effectively a response addresses a question. It assesses
310
310
**Ratings:**
311
311
312
312
-**[Relevance: 1] (Irrelevant Response)**
313
-
*Definition*: The response is unrelated to the question. It provides information that is off-topic and doesn't attempt to address the question posed.
313
+
-**Definition**: The response is unrelated to the question. It provides information that is off-topic and doesn't attempt to address the question posed.
314
314
-**[Relevance: 2] (Incorrect Response)**
315
-
**Definition**: The response attempts to address the question but includes incorrect information. It provides a response that is factually wrong based on the provided information.
315
+
-**Definition**: The response attempts to address the question but includes incorrect information. It provides a response that is factually wrong based on the provided information.
316
316
-**[Relevance: 3] (Incomplete Response)**
317
-
**Definition**: The response addresses the question but omits key details necessary for a full understanding. It provides a partial response that lacks essential information.
317
+
-**Definition**: The response addresses the question but omits key details necessary for a full understanding. It provides a partial response that lacks essential information.
318
318
-**[Relevance: 4] (Complete Response)**
319
-
**Definition**: The response fully addresses the question with accurate and complete information. It includes all essential details required for a comprehensive understanding, without adding any extraneous information.
319
+
-**Definition**: The response fully addresses the question with accurate and complete information. It includes all essential details required for a comprehensive understanding, without adding any extraneous information.
320
320
-**[Relevance: 5] (Comprehensive Response with Insights)**
321
-
**Definition**: The response not only fully and accurately addresses the question but also includes additional relevant insights or elaboration. It might explain the significance, implications, or provide minor inferences that enhance understanding.
321
+
-**Definition**: The response not only fully and accurately addresses the question but also includes additional relevant insights or elaboration. It might explain the significance, implications, or provide minor inferences that enhance understanding.
322
322
323
323
### AI-assisted: Coherence
324
324
@@ -339,15 +339,15 @@ Coherence refers to the logical and orderly presentation of ideas in a response,
339
339
**Ratings:**
340
340
341
341
-**[Coherence: 1] (Incoherent Response)**
342
-
**Definition**: The response lacks coherence entirely. It consists of disjointed words or phrases that don't form complete or meaningful sentences. There's no logical connection to the question, making the response incomprehensible.
342
+
-**Definition**: The response lacks coherence entirely. It consists of disjointed words or phrases that don't form complete or meaningful sentences. There's no logical connection to the question, making the response incomprehensible.
343
343
-**[Coherence: 2] (Poorly Coherent Response)**
344
-
**Definition**: The response shows minimal coherence with fragmented sentences and limited connection to the question. It contains some relevant keywords but lacks logical structure and clear relationships between ideas, making the overall message difficult to understand.
344
+
-**Definition**: The response shows minimal coherence with fragmented sentences and limited connection to the question. It contains some relevant keywords but lacks logical structure and clear relationships between ideas, making the overall message difficult to understand.
345
345
-**[Coherence: 3] (Partially Coherent Response)**
346
-
**Definition**: The response partially addresses the question with some relevant information but exhibits issues in the logical flow and organization of ideas. Connections between sentences might be unclear or abrupt, requiring the reader to infer the links. The response might lack smooth transitions and might present ideas out of order.
346
+
-**Definition**: The response partially addresses the question with some relevant information but exhibits issues in the logical flow and organization of ideas. Connections between sentences might be unclear or abrupt, requiring the reader to infer the links. The response might lack smooth transitions and might present ideas out of order.
347
347
-**[Coherence: 4] (Coherent Response)**
348
-
**Definition**: The response is coherent and effectively addresses the question. Ideas are logically organized with clear connections between sentences and paragraphs. Appropriate transitions are used to guide the reader through the response, which flows smoothly and is easy to follow.
348
+
-**Definition**: The response is coherent and effectively addresses the question. Ideas are logically organized with clear connections between sentences and paragraphs. Appropriate transitions are used to guide the reader through the response, which flows smoothly and is easy to follow.
349
349
-**[Coherence: 5] (Highly Coherent Response)**
350
-
**Definition**: The response is exceptionally coherent, demonstrating sophisticated organization and flow. Ideas are presented in a logical and seamless manner, with excellent use of transitional phrases and cohesive devices. The connections between concepts are clear and enhance the reader's understanding. The response thoroughly addresses the question with clarity and precision.
350
+
-**Definition**: The response is exceptionally coherent, demonstrating sophisticated organization and flow. Ideas are presented in a logical and seamless manner, with excellent use of transitional phrases and cohesive devices. The connections between concepts are clear and enhance the reader's understanding. The response thoroughly addresses the question with clarity and precision.
351
351
352
352
### AI-assisted: Fluency
353
353
@@ -454,9 +454,7 @@ This rating value should always be an integer between 1 and 5. So the rating pro
454
454
| When to use it? | The recommended scenario is Natural Language Processing (NLP) tasks. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is a recommended use case for the METEOR score.|
455
455
| What does it need as input? | Response, Ground Truth |
456
456
457
-
## Supported data format
458
-
459
-
457
+
## Supported data format
460
458
461
459
Azure AI Foundry allows you to easily evaluate simple query and response pairs or complex, single/multi-turn conversations where you ground the generative AI model in your specific data (also known as Retrieval Augmented Generation or RAG). Currently, we support the following data formats.
462
460
@@ -475,11 +473,11 @@ Users pose single queries or prompts, and a generative AI model is employed to i
475
473
476
474
Users engage in conversational interactions, either through a series of multiple user and assistant turns or in a single exchange. The generative AI model, equipped with retrieval mechanisms, generates responses and can access and incorporate information from external sources, such as documents. The Retrieval Augmented Generation (RAG) model enhances the quality and relevance of responses by using external documents and knowledge and can be injected into the conversation dataset in the supported format.
477
475
478
-
A conversation is a Python dictionary of a list of messages (which include content, role, and optionally context). The following is an example of a two-turn conversation.
476
+
A conversation is a Python dictionary of a list of messages (which include content, role, and optionally context). The following is an example of a two-turn conversation.
479
477
480
478
The test set format follows this data format:
481
479
```jsonl
482
-
"conversation": {"messages": [ { "content": "Which tent is the most waterproof?", "role": "user" }, { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." }, { "content": "How much does it cost?", "role": "user" }, { "content": "The Alpine Explorer Tent is $120.", "role": "assistant", "context": null } ] }}
480
+
"conversation": {"messages": [ { "content": "Which tent is the most waterproof?", "role": "user" }, { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." }, { "content": "How much does it cost?", "role": "user" }, { "content": "The Alpine Explorer Tent is $120.", "role": "assistant", "context": null } ] }
0 commit comments