Skip to content

Commit 8ce97bc

Browse files
committed
fixes
1 parent bf3d217 commit 8ce97bc

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

articles/ai-studio/concepts/evaluation-approach-gen-ai.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Key considerations at this stage might include:
3737
- **Bias and ethical considerations**: Does the model produce any outputs that may perpetuate or promote harmful stereotypes?
3838
- **Risk and safety**: Are there any risks of the model generating unsafe or malicious content?
3939

40-
You can explore [Azure AI Foundry benchmarks](../model-benchmarks.md)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
40+
You can explore [Azure AI Foundry benchmarks](./model-benchmarks.md)to evaluate and compare models on publicly available datasets, while also regenerating benchmark results on your own data. Alternatively, you can evaluate one of many base generative AI models via Azure AI Evaluation SDK as demonstrated, see [Evaluate model endpoints sample](https://github.com/Azure-Samples/azureai-samples/blob/main/scenarios/evaluate/evaluate_endpoints/evaluate_endpoints.ipynb).
4141

4242
## Pre-production evaluation
4343

@@ -54,7 +54,7 @@ Pre-production evaluation involves:
5454

5555
The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that does not meet the desired performance or safety standards.
5656

57-
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk) supported evaluators, including [generation quality, safety,](..evaluation-metrics-built-in) or [custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators), and [view results via the Azure AI Foundry portal](../how-to/evaluate-results.md).
57+
- Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with Azure AI Foundry or [Azure AI Evaluation SDK’s](../how-to/develop/evaluate-sdk.md) supported evaluators, including [generation quality, safety,](..evaluation-metrics-built-in) or [custom evaluators](../how-to/develop/evaluate-sdk.md#custom-evaluators), and [view results via the Azure AI Foundry portal](../how-to/evaluate-results.md).
5858
- Simulators: If you don’t have evaluation data (test data), Azure AI [Evaluation SDK’s simulators](..//how-to/develop/simulator-interaction-data.md) can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
5959
- The [adversarial simulator](../how-to/develop/simulator-interaction-data.md#generate-adversarial-simulations-for-safety-evaluation) injects queries that mimic potential security threats or attempt jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
6060
- [Context-appropriate simulators](../how-to/develop/simulator-interaction-data.md#generate-synthetic-data-and-simulate-non-adversarial-tasks) generate typical, relevant conversations you’d expect from users to test quality of responses.
@@ -80,11 +80,11 @@ Cheat sheet:
8080

8181
| Purpose | Process | Parameters |
8282
| -----| -----| ----|
83-
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) </br> - [Safety and Security](./evaluation-metrics-built-in.md?tabs=warning#risk-and-safety-metrics)) ([Safety and Security sample notebook]((https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py))) </br> [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))] |
84-
| What data should you use? | Upload or generate relevant dataset | [Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ( [Generic simulator sample notebook|(https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)] </br> - Adversarial simulator for measuring Safety and Security [Adversarial simulator Docs](../how-to/develop/simulator-interaction-data.md) (Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb)] ) |
85-
| What resources should conduct the evaluation? | Run evaluation | - Local run </br> - Remote cloud run |
86-
| How did my model/app perform? | Analyze results | [View aggregate scores, view details, score details, compare eval runs](..//how-to/evaluate-results.md)] |
87-
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results did not align to human feedback, adjust your evaluator. </br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
83+
| What are you evaluating for? | Identify or build relevant evaluators | - [Quality and performance](./evaluation-metrics-built-in.md?tabs=warning#generation-quality-metrics) ( [Quality and performance sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py))<br> </br> - [Safety and Security](./evaluation-metrics-built-in.md?tabs=warning#risk-and-safety-metrics) ([Safety and Security sample notebook]((https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluatesafetyrisks.py))) <br> </br> - [Custom](../how-to/develop/evaluate-sdk.md#custom-evaluators) ([Custom sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/evaluate.py)) |
84+
| What data should you use? | Upload or generate relevant dataset | [Generic simulator for measuring Quality and Performance](./concept-synthetic-data.md) ([Generic simulator sample notebook](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/datagen/synthetic-data-generation.ipynb)) <br></br> - [Adversarial simulator for measuring Safety and Security](../how-to/develop/simulator-interaction-data.md) ([Adversarial simulator sample notebook](https://github.com/Azure-Samples/rag-data-openai-python-promptflow/blob/main/src/evaluation/simulate_and_evaluate_online_endpoint.ipynb))|
85+
| What resources should conduct the evaluation? | Run evaluation | - Local run <br> </br> - Remote cloud run |
86+
| How did my model/app perform? | Analyze results | [View aggregate scores, view details, score details, compare eval runs](..//how-to/evaluate-results.md) |
87+
| How can I improve? | Make changes to model, app, or evaluators | - If evaluation results did not align to human feedback, adjust your evaluator. <br></br> - If evaluation results aligned to human feedback but did not meet quality/safety thresholds, apply targeted mitigations. |
8888

8989
## Related content
9090

0 commit comments

Comments
 (0)