You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In today's AI-driven world, Generative AI Operations (GenAIOps) is revolutionizing how organizations build and deploy intelligent systems. As companies increasingly use AI to transform decision-making, enhance customer experiences, and fuel innovation, one element stands paramount: robust evaluation frameworks. Evaluation isn't just a checkpoint. It's the foundation of trust in AI applications. Without rigorous assessment, AI systems can produce content that's:
20
+
In today's AI-driven world, Generative AI Operations (GenAIOps) is revolutionizing how organizations build and deploy intelligent systems. As companies increasingly use AI to transform decisionmaking, enhance customer experiences, and fuel innovation, one element stands paramount: robust evaluation frameworks. Evaluation isn't just a checkpoint. It's the foundation of trust in AI applications. Without rigorous assessment, AI systems can produce content that's:
21
21
22
22
- Fabricated or ungrounded in reality
23
23
- Irrelevant or incoherent to user needs
@@ -115,7 +115,7 @@ GenAIOps uses the following three stages.
115
115
116
116
### Base model selection
117
117
118
-
Before building your application, you need to select the right foundation. This initial evaluation helps you compare different models based on:
118
+
Before you build your application, select the right foundation. This initial evaluation helps you compare different models based on:
119
119
120
120
- Quality and accuracy: How relevant and coherent are the model's responses?
121
121
- Task performance: Does the model handle your specific use cases efficiently?
@@ -126,7 +126,7 @@ Before building your application, you need to select the right foundation. This
126
126
127
127
### Pre-production evaluation
128
128
129
-
After you select a base model, the next step is to develop an AI application, such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. When development is complete, *pre-production evaluation* begins. Before you deploy to a production environment, thorough testing is essential to ensure the model is ready for real-world use.
129
+
After you select a base model, the next step is to develop an AI application, such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. When development is complete, *pre-production evaluation* begins. Before you deploy to a production environment, thorough testing is essential to ensure that the model is ready for real-world use.
130
130
131
131
Pre-production evaluation involves:
132
132
@@ -141,27 +141,30 @@ The pre-production stage acts as a final quality check, reducing the risk of dep
141
141
142
142
Evaluation Tools and Approaches:
143
143
144
-
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with supported evaluators, including generation quality, safety, or custom evaluators. View results by using the Azure AI Foundry portal. Use Azure AI Foundry’s evaluation wizard or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk.md) supported evaluators, including generation quality, safety, or [custom evaluators](./evaluation-evaluators/custom-evaluators.md). [View results by using the Azure AI Foundry portal](../how-to/evaluate-results.md).
145
-
- Simulators and AI red teaming agent (preview): If you don’t have evaluation data (test data), [Azure AI Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
144
+
-**Bring your own data**: You can evaluate your AI applications in pre-production using your own evaluation data with supported evaluators, including generation quality, safety, or custom evaluators. View results by using the Azure AI Foundry portal.
145
+
146
+
Use Azure AI Foundry’s evaluation wizard or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk.md) supported evaluators, including generation quality, safety, or [custom evaluators](./evaluation-evaluators/custom-evaluators.md). [View results by using the Azure AI Foundry portal](../how-to/evaluate-results.md).
147
+
148
+
-**Simulators and AI red teaming agent (preview)**: If you don’t have evaluation data or test data, [Azure AI Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
146
149
147
150
-[Adversarial simulators](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation) inject static queries that mimic potential safety risks or security attacks or attempted jailbreaks. The simulators help identify limitations to prepare the model for unexpected conditions.
148
151
-[Context-appropriate simulators](../how-to/develop/simulator-interaction-data.md#generate-synthetic-data-and-simulate-non-adversarial-tasks) generate typical, relevant conversations you might expect from users to test quality of responses. With context-appropriate simulators, you can assess metrics such as groundedness, relevance, coherence, and fluency of generated responses.
149
152
-[AI red teaming agent (preview)](../how-to/develop/run-scans-ai-red-teaming-agent.md) simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks. It uses Microsoft’s open framework for Python Risk Identification Tool (PyRIT).
150
153
151
154
Automated scans using the AI red teaming agent enhance pre-production risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment.
152
155
153
-
By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. We recommend this tool to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.
156
+
By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. We recommend that you use this tool along with human-in-the-loop processes, such as conventional AI red teaming probing, to help accelerate risk identification and aid in the assessment by a human expert.
154
157
155
158
Alternatively, you can also use [evaluation functionality](../how-to/evaluate-generative-ai-app.md) in the Azure AI Foundry portal for testing your generative AI applications.
156
159
157
-
After you achieve satisfactory results, you can deploy the AI application to production.
160
+
After you get satisfactory results, you can deploy the AI application to production.
158
161
159
162
### Post-production monitoring
160
163
161
164
After deployment, continuous monitoring ensures your AI application maintains quality in real-world conditions.
162
165
163
-
- Performance tracking: Regular measurement of key metrics.
164
-
- Incident response: Swift action when harmful or inappropriate outputs occur.
166
+
-**Performance tracking**: Regular measurement of key metrics.
167
+
-**Incident response**: Swift action when harmful or inappropriate outputs occur.
165
168
166
169
Effective monitoring helps maintain user trust and allows for rapid issue resolution.
0 commit comments