Skip to content

Commit 25c13ba

Browse files
committed
aroclinx
1 parent 34c57f1 commit 25c13ba

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

articles/ai-studio/concepts/evaluation-approach-gen-ai.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@ author: lgayhardt
1818

1919
[!INCLUDE [feature-preview](../includes/feature-preview.md)]
2020

21-
In the rapidly evolving landscape of artificial intelligence, the integration of Generative AI Operations (GenAIOps) is transforming how organizations develop and deploy AI applications. As businesses increasingly rely on AI to enhance decision-making, improve customer experiences, and drive innovation, the importance of a robust evaluation framework cannot be overstated. Evaluation is an essential component of the generative AI lifecycle to build confidence and trust in AI-centric applications. If not designed carefully, these applications can produce outputs that are fabricated and ungrounded in context, irrelevant or incoherent, resulting in poor customer experiences, or worse, perpetuate societal stereotypes, promote misinformation, expose organizations to malicious attacks, or a wide range of other negative impacts.
21+
In the rapidly evolving landscape of artificial intelligence, the integration of Generative AI Operations (GenAIOps) is transforming how organizations develop and deploy AI applications. As businesses increasingly rely on AI to enhance decision-making, improve customer experiences, and drive innovation, the importance of a robust evaluation framework can't be overstated. Evaluation is an essential component of the generative AI lifecycle to build confidence and trust in AI-centric applications. If not designed carefully, these applications can produce outputs that are fabricated and ungrounded in context, irrelevant or incoherent, resulting in poor customer experiences, or worse, perpetuate societal stereotypes, promote misinformation, expose organizations to malicious attacks, or a wide range of other negative impacts.
2222

23-
Evaluators are helpful tools to assess the frequency and severity of content risks or undesirable behavior in AI responses. Performing iterative, systematic evaluations with the right evaluators can help teams measure and address potential response quality, safety, or security concerns throughout the AI development lifecycle, from initial model selection through post-production monitoring. Evaluation within the GenAI Ops Lifecycle
23+
Evaluators are helpful tools to assess the frequency and severity of content risks or undesirable behavior in AI responses. Performing iterative, systematic evaluations with the right evaluators can help teams measure and address potential response quality, safety, or security concerns throughout the AI development lifecycle, from initial model selection through post-production monitoring. Evaluation within the GenAI Ops Lifecycle production.
2424

2525
:::image type="content" source="../media/evaluations/lifecycle.png" alt-text="Diagram of enterprise GenAIOps lifecycle, showing model selection, building an AI application, and operationalizing." lightbox="../media/evaluations/lifecycle.png":::
2626

27-
production. By understanding and implementing effective evaluation strategies at each stage, organizations can ensure their AI solutions not only meet initial expectations but also adapt and thrive in real-world environments. Let's dive into how evaluation fits into the three critical stages of the AI lifecycle
27+
By understanding and implementing effective evaluation strategies at each stage, organizations can ensure their AI solutions not only meet initial expectations but also adapt and thrive in real-world environments. Let's dive into how evaluation fits into the three critical stages of the AI lifecycle
2828

2929
## Base model selection
3030

@@ -34,7 +34,7 @@ Key considerations at this stage might include:
3434

3535
- **Accuracy/quality**: How well does the model generate relevant and coherent responses?
3636
- **Performance on specific tasks**: Can the model handle the type of prompts and content required for your use case? How is its latency and cost?
37-
- **Bias and ethical considerations**: Does the model produce any outputs that may perpetuate or promote harmful stereotypes?
37+
- **Bias and ethical considerations**: Does the model produce any outputs that might perpetuate or promote harmful stereotypes?
3838
- **Risk and safety**: Are there any risks of the model generating unsafe or malicious content?
3939

4040
You can explore [Azure AI Foundry benchmarks](./model-benchmarks.md)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
@@ -43,16 +43,16 @@ You can explore [Azure AI Foundry benchmarks](./model-benchmarks.md)to evaluate
4343

4444
After selecting a base model, the next step is to develop an AI application—such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. Following development, pre-production evaluation begins. Before deploying the application in a production environment, rigorous testing is essential to ensure the model is truly ready for real-world use.
4545

46-
:::image type="content" source="../media/evaluations/evaluation-models-diagram.png" alt-text="Diagram of pre-production evaluation for models and applications with the 6 steps." lightbox="../media/evaluations/evaluation-models-diagram.png ":::
46+
:::image type="content" source="../media/evaluations/evaluation-models-diagram.png" alt-text="Diagram of pre-production evaluation for models and applications with the six steps." lightbox="../media/evaluations/evaluation-models-diagram.png ":::
4747

4848
Pre-production evaluation involves:
4949

5050
- **Testing with evaluation datasets**: These datasets simulate realistic user interactions to ensure the AI application performs as expected.
51-
- **Identifying edge cases**: Finding scenarios where the AI application’s response quality may degrade or produce undesirable outputs.
51+
- **Identifying edge cases**: Finding scenarios where the AI application’s response quality might degrade or produce undesirable outputs.
5252
- **Assessing robustness**: Ensuring that the model can handle a range of input variations without significant drops in quality or safety.
5353
- **Measuring key metrics**: Metrics such as response groundedness, relevance, and safety are evaluated to confirm readiness for production.
5454

55-
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that does not meet the desired performance or safety standards.
55+
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that doesn't meet the desired performance or safety standards.
5656

5757
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk.md) supported evaluators, including [generation quality, safety,](./evaluation-metrics-built-in.md) or [custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators), and [view results via the Azure AI Foundry portal](../how-to/evaluate-results.md).
5858
- Simulators: If you don’t have evaluation data (test data), Azure AI [Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
@@ -68,13 +68,13 @@ Once satisfactory results are achieved, the AI application can be deployed to pr
6868
After deployment, the AI application enters the post-production evaluation phase, also known as online evaluation or monitoring. At this stage, the model is embedded within a real-world product and responds to actual user queries. Monitoring ensures that the model continues to behave as expected and adapts to any changes in user behavior or content.
6969

7070
- **Ongoing performance tracking**: Regularly measuring AI application’s response using key metrics to ensure consistent output quality.
71-
- **Incident response**: Quickly responding to any harmful, unfair, or inappropriate outputs that may arise during real-world use.
71+
- **Incident response**: Quickly responding to any harmful, unfair, or inappropriate outputs that might arise during real-world use.
7272

7373
By [continuously monitoring the AI application’s behavior in production](https://aka.ms/AzureAIMonitoring), you can maintain high-quality user experiences and swiftly address any issues that surface.
7474

7575
## Conclusion
7676

77-
GenAIOps is all about establishing a reliable and repeatable process for managing generative AI applications across their lifecycle. Evaluation plays a vital role at each stage, from base model selection, through pre-production testing, to ongoing post-production monitoring. By systematically measuring and addressing risks and refining AI systems at every step, teams can build generative AI solutions that are not only powerful but also trustworthy and safe for real-world use.
77+
GenAIOps is all about establishing a reliable and repeatable process for managing generative AI applications across their lifecycle. Evaluation plays a vital role at each stage, from base model selection, through pre-production testing, to ongoing post-production monitoring. By systematically measuring and addressing risks and refining AI systems at every step, teams can build generative AI solutions that aren't only powerful but also trustworthy and safe for real-world use.
7878

7979
Cheat sheet:
8080

@@ -83,8 +83,8 @@ Cheat sheet:
8383
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))<br> </br> - [Safety and Security](./evaluation-metrics-built-in.md?tabs=warning#risk-and-safety-metrics) ([Safety and Security sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py)) <br> </br> - [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) |
8484
| What data should you use? | Upload or generate relevant dataset | [Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ([Generic simulator sample notebook](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)) <br></br> - [Adversarial simulator for measuring Safety and Security](../how-to/develop/simulator-interaction-data.md) ([Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb))|
8585
| What resources should conduct the evaluation? | Run evaluation | - Local run <br> </br> - Remote cloud run |
86-
| How did my model/app perform? | Analyze results | [View aggregate scores, view details, score details, compare eval runs](..//how-to/evaluate-results.md) |
87-
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results did not align to human feedback, adjust your evaluator. <br></br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
86+
| How did my model/app perform? | Analyze results | [View aggregate scores, view details, score details, compare evaluation runs](..//how-to/evaluate-results.md) |
87+
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results didn't align to human feedback, adjust your evaluator. <br></br> - If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. |
8888

8989
## Related content
9090

0 commit comments

Comments
 (0)