You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-studio/concepts/evaluation-approach-gen-ai.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ Key considerations at this stage might include:
37
37
-**Bias and ethical considerations**: Does the model produce any outputs that may perpetuate or promote harmful stereotypes?
38
38
-**Risk and safety**: Are there any risks of the model generating unsafe or malicious content?
39
39
40
-
You can explore [Azure AI Foundry benchmarks](aka.ms/azureaibenchmarks)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
40
+
You can explore [Azure AI Foundry benchmarks](../model-benchmarks.md)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that does not meet the desired performance or safety standards.
56
56
57
-
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](aka.ms/azureaievalsSDK) supported evaluators, including g[eneration quality, safety,](aka.ms/evaluationmetrics) or [custom evaluators](aka.ms/customevaluators), and [view results via the Azure AI Foundry portal](aka.ms/AzureAIStudioEvaluationUI).
58
-
- Simulators: If you don’t have evaluation data (test data), Azure AI [Evaluation SDK’s simulators](aka.ms/AzureAIStudioSimulator) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
59
-
- The [adversarial simulator](ka.ms/adversarialsimulator) injects queries that mimic potential security threats or attempt jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
60
-
-[Context-appropriate simulators](aka.ms/topicrelatedsimulator) generate typical, relevant conversations you’d expect from users to test quality of responses.
57
+
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk) supported evaluators, including [generation quality, safety,](..evaluation-metrics-built-in) or [custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators), and [view results via the Azure AI Foundry portal](../how-to/evaluate-results.md).
58
+
- Simulators: If you don’t have evaluation data (test data), Azure AI [Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
59
+
- The [adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation) injects queries that mimic potential security threats or attempt jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
60
+
-[Context-appropriate simulators](../how-to/develop/simulator-interaction-data.md#generate-synthetic-data-and-simulate-non-adversarial-tasks) generate typical, relevant conversations you’d expect from users to test quality of responses.
61
61
62
-
Alternatively, you can also use [Azure AI Foundry’s evaluation widget](aka.ms/azureaievalsUIWidget) for testing your generative AI applications.
62
+
Alternatively, you can also use [Azure AI Foundry’s evaluation widget](../how-to/evaluate-generative-ai-app.md) for testing your generative AI applications.
63
63
64
64
Once satisfactory results are achieved, the AI application can be deployed to production.
65
65
@@ -84,7 +84,7 @@ Cheat sheet:
84
84
| What data should you use? | Upload or generate relevant dataset |[Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ( [Generic simulator sample notebook|(https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)] </br> - Adversarial simulator for measuring Safety and Security [Adversarial simulator Docs](../how-to/develop/simulator-interaction-data.md) (Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb)] ) |
85
85
| What resources should conduct the evaluation? | Run evaluation | - Local run </br> - Remote cloud run |
86
86
| How did my model/app perform? | Analyze results |[View aggregate scores, view details, score details, compare eval runs](..//how-to/evaluate-results.md)]|
87
-
| How can I improve? | Make changes to model, app, or evaluators - If evaluation results did not align to human feedback, adjust your evaluator. </br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
87
+
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results did not align to human feedback, adjust your evaluator. </br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
0 commit comments