You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the rapidly evolving landscape of artificial intelligence, the integration of Generative AI Operations (GenAIOps) is transforming how organizations develop and deploy AI applications. As businesses increasingly rely on AI to enhance decision-making, improve customer experiences, and drive innovation, the importance of a robust evaluation framework cannot be overstated. Evaluation is an essential component of the generative AI lifecycle to build confidence and trust in AI-centric applications. If not designed carefully, these applications can produce outputs that are fabricated and ungrounded in context, irrelevant or incoherent, resulting in poor customer experiences, or worse, perpetuate societal stereotypes, promote misinformation, expose organizations to malicious attacks, or a wide range of other negative impacts.
21
+
In the rapidly evolving landscape of artificial intelligence, the integration of Generative AI Operations (GenAIOps) is transforming how organizations develop and deploy AI applications. As businesses increasingly rely on AI to enhance decision-making, improve customer experiences, and drive innovation, the importance of a robust evaluation framework can't be overstated. Evaluation is an essential component of the generative AI lifecycle to build confidence and trust in AI-centric applications. If not designed carefully, these applications can produce outputs that are fabricated and ungrounded in context, irrelevant or incoherent, resulting in poor customer experiences, or worse, perpetuate societal stereotypes, promote misinformation, expose organizations to malicious attacks, or a wide range of other negative impacts.
22
22
23
-
Evaluators are helpful tools to assess the frequency and severity of content risks or undesirable behavior in AI responses. Performing iterative, systematic evaluations with the right evaluators can help teams measure and address potential response quality, safety, or security concerns throughout the AI development lifecycle, from initial model selection through post-production monitoring. Evaluation within the GenAI Ops Lifecycle
23
+
Evaluators are helpful tools to assess the frequency and severity of content risks or undesirable behavior in AI responses. Performing iterative, systematic evaluations with the right evaluators can help teams measure and address potential response quality, safety, or security concerns throughout the AI development lifecycle, from initial model selection through post-production monitoring. Evaluation within the GenAI Ops Lifecycle production.
24
24
25
25
:::image type="content" source="../media/evaluations/lifecycle.png" alt-text="Diagram of enterprise GenAIOps lifecycle, showing model selection, building an AI application, and operationalizing." lightbox="../media/evaluations/lifecycle.png":::
26
26
27
-
production. By understanding and implementing effective evaluation strategies at each stage, organizations can ensure their AI solutions not only meet initial expectations but also adapt and thrive in real-world environments. Let's dive into how evaluation fits into the three critical stages of the AI lifecycle
27
+
By understanding and implementing effective evaluation strategies at each stage, organizations can ensure their AI solutions not only meet initial expectations but also adapt and thrive in real-world environments. Let's dive into how evaluation fits into the three critical stages of the AI lifecycle
28
28
29
29
## Base model selection
30
30
@@ -34,7 +34,7 @@ Key considerations at this stage might include:
34
34
35
35
-**Accuracy/quality**: How well does the model generate relevant and coherent responses?
36
36
-**Performance on specific tasks**: Can the model handle the type of prompts and content required for your use case? How is its latency and cost?
37
-
-**Bias and ethical considerations**: Does the model produce any outputs that may perpetuate or promote harmful stereotypes?
37
+
-**Bias and ethical considerations**: Does the model produce any outputs that might perpetuate or promote harmful stereotypes?
38
38
-**Risk and safety**: Are there any risks of the model generating unsafe or malicious content?
39
39
40
40
You can explore [Azure AI Foundry benchmarks](./model-benchmarks.md)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
@@ -43,16 +43,16 @@ You can explore [Azure AI Foundry benchmarks](./model-benchmarks.md)to evaluate
43
43
44
44
After selecting a base model, the next step is to develop an AI application—such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. Following development, pre-production evaluation begins. Before deploying the application in a production environment, rigorous testing is essential to ensure the model is truly ready for real-world use.
45
45
46
-
:::image type="content" source="../media/evaluations/evaluation-models-diagram.png" alt-text="Diagram of pre-production evaluation for models and applications with the 6 steps." lightbox="../media/evaluations/evaluation-models-diagram.png ":::
46
+
:::image type="content" source="../media/evaluations/evaluation-models-diagram.png" alt-text="Diagram of pre-production evaluation for models and applications with the six steps." lightbox="../media/evaluations/evaluation-models-diagram.png ":::
47
47
48
48
Pre-production evaluation involves:
49
49
50
50
-**Testing with evaluation datasets**: These datasets simulate realistic user interactions to ensure the AI application performs as expected.
51
-
-**Identifying edge cases**: Finding scenarios where the AI application’s response quality may degrade or produce undesirable outputs.
51
+
-**Identifying edge cases**: Finding scenarios where the AI application’s response quality might degrade or produce undesirable outputs.
52
52
-**Assessing robustness**: Ensuring that the model can handle a range of input variations without significant drops in quality or safety.
53
53
-**Measuring key metrics**: Metrics such as response groundedness, relevance, and safety are evaluated to confirm readiness for production.
54
54
55
-
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that does not meet the desired performance or safety standards.
55
+
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that doesn't meet the desired performance or safety standards.
56
56
57
57
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk.md) supported evaluators, including [generation quality, safety,](./evaluation-metrics-built-in.md) or [custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators), and [view results via the Azure AI Foundry portal](../how-to/evaluate-results.md).
58
58
- Simulators: If you don’t have evaluation data (test data), Azure AI [Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
@@ -68,13 +68,13 @@ Once satisfactory results are achieved, the AI application can be deployed to pr
68
68
After deployment, the AI application enters the post-production evaluation phase, also known as online evaluation or monitoring. At this stage, the model is embedded within a real-world product and responds to actual user queries. Monitoring ensures that the model continues to behave as expected and adapts to any changes in user behavior or content.
69
69
70
70
-**Ongoing performance tracking**: Regularly measuring AI application’s response using key metrics to ensure consistent output quality.
71
-
-**Incident response**: Quickly responding to any harmful, unfair, or inappropriate outputs that may arise during real-world use.
71
+
-**Incident response**: Quickly responding to any harmful, unfair, or inappropriate outputs that might arise during real-world use.
72
72
73
73
By [continuously monitoring the AI application’s behavior in production](https://aka.ms/AzureAIMonitoring), you can maintain high-quality user experiences and swiftly address any issues that surface.
74
74
75
75
## Conclusion
76
76
77
-
GenAIOps is all about establishing a reliable and repeatable process for managing generative AI applications across their lifecycle. Evaluation plays a vital role at each stage, from base model selection, through pre-production testing, to ongoing post-production monitoring. By systematically measuring and addressing risks and refining AI systems at every step, teams can build generative AI solutions that are not only powerful but also trustworthy and safe for real-world use.
77
+
GenAIOps is all about establishing a reliable and repeatable process for managing generative AI applications across their lifecycle. Evaluation plays a vital role at each stage, from base model selection, through pre-production testing, to ongoing post-production monitoring. By systematically measuring and addressing risks and refining AI systems at every step, teams can build generative AI solutions that aren't only powerful but also trustworthy and safe for real-world use.
78
78
79
79
Cheat sheet:
80
80
@@ -83,8 +83,8 @@ Cheat sheet:
83
83
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))<br> </br> - [Safety and Security](./evaluation-metrics-built-in.md?tabs=warning#risk-and-safety-metrics) ([Safety and Security sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py)) <br> </br> - [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) |
84
84
| What data should you use? | Upload or generate relevant dataset |[Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ([Generic simulator sample notebook](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)) <br></br> - [Adversarial simulator for measuring Safety and Security](../how-to/develop/simulator-interaction-data.md) ([Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb))|
85
85
| What resources should conduct the evaluation? | Run evaluation | - Local run <br> </br> - Remote cloud run |
86
-
| How did my model/app perform? | Analyze results |[View aggregate scores, view details, score details, compare eval runs](..//how-to/evaluate-results.md)|
87
-
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results did not align to human feedback, adjust your evaluator. <br></br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
86
+
| How did my model/app perform? | Analyze results |[View aggregate scores, view details, score details, compare evaluation runs](..//how-to/evaluate-results.md)|
87
+
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results didn't align to human feedback, adjust your evaluator. <br></br> - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. |
0 commit comments