|
1 | 1 | # Azure AI Evaluation client library for Python |
2 | 2 |
|
3 | | -We are excited to introduce the public preview of the Azure AI Evaluation SDK. |
| 3 | +Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations. |
| 4 | + |
| 5 | +Use Azure AI Evaluation SDK to: |
| 6 | +- Evaluate existing data from generative AI applications |
| 7 | +- Evaluate generative AI applications |
| 8 | +- Evaluate by generating mathematical, AI-assisted quality and safety metrics |
| 9 | + |
| 10 | +Azure AI SDK provides following to evaluate Generative AI Applications: |
| 11 | +- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API. |
| 12 | +- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators. |
4 | 13 |
|
5 | 14 | [Source code][source_code] |
6 | 15 | | [Package (PyPI)][evaluation_pypi] |
7 | 16 | | [API reference documentation][evaluation_ref_docs] |
8 | 17 | | [Product documentation][product_documentation] |
9 | 18 | | [Samples][evaluation_samples] |
10 | 19 |
|
11 | | -This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12. |
12 | | - |
13 | | -For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all |
14 | 20 |
|
15 | 21 | ## Getting started |
16 | 22 |
|
17 | 23 | ### Prerequisites |
18 | 24 |
|
19 | 25 | - Python 3.8 or later is required to use this package. |
| 26 | +- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators |
20 | 27 |
|
21 | 28 | ### Install the package |
22 | 29 |
|
23 | | -Install the Azure AI Evaluation library for Python with [pip][pip_link]:: |
| 30 | +Install the Azure AI Evaluation SDK for Python with [pip][pip_link]: |
24 | 31 |
|
25 | 32 | ```bash |
26 | 33 | pip install azure-ai-evaluation |
27 | 34 | ``` |
| 35 | +If you want to track results in [AI Studio][ai_studio], install `remote` extra: |
| 36 | +```python |
| 37 | +pip install azure-ai-evaluation[remote] |
| 38 | +``` |
28 | 39 |
|
29 | 40 | ## Key concepts |
30 | 41 |
|
31 | | -Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models. |
| 42 | +### Evaluators |
32 | 43 |
|
33 | | -## Examples |
| 44 | +Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications. |
34 | 45 |
|
35 | | -### Evaluators |
| 46 | +#### Built-in evaluators |
| 47 | + |
| 48 | +Built-in evaluators are out of box evaluators provided by Microsoft: |
| 49 | +| Category | Evaluator class | |
| 50 | +|-----------|------------------------------------------------------------------------------------------------------------------------------------| |
| 51 | +| [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` | |
| 52 | +| [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`| |
| 53 | +| [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` | |
| 54 | +| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` | |
36 | 55 |
|
37 | | -Users can create evaluator runs on the local machine as shown in the example below: |
| 56 | +For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics]. |
38 | 57 |
|
39 | 58 | ```python |
40 | 59 | import os |
41 | | -from pprint import pprint |
42 | 60 |
|
43 | | -from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator |
| 61 | +from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator |
| 62 | + |
| 63 | +# NLP bleu score evaluator |
| 64 | +bleu_score_evaluator = BleuScoreEvaluator() |
| 65 | +result = bleu_score( |
| 66 | + response="Tokyo is the capital of Japan.", |
| 67 | + ground_truth="The capital of Japan is Tokyo." |
| 68 | +) |
| 69 | + |
| 70 | +# AI assisted quality evaluator |
| 71 | +model_config = { |
| 72 | + "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"), |
| 73 | + "api_key": os.environ.get("AZURE_OPENAI_API_KEY"), |
| 74 | + "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"), |
| 75 | +} |
| 76 | + |
| 77 | +relevance_evaluator = RelevanceEvaluator(model_config) |
| 78 | +result = relevance_evaluator( |
| 79 | + query="What is the capital of Japan?", |
| 80 | + response="The capital of Japan is Tokyo." |
| 81 | +) |
| 82 | + |
| 83 | +# AI assisted safety evaluator |
| 84 | +azure_ai_project = { |
| 85 | + "subscription_id": "<subscription_id>", |
| 86 | + "resource_group_name": "<resource_group_name>", |
| 87 | + "project_name": "<project_name>", |
| 88 | +} |
| 89 | + |
| 90 | +violence_evaluator = ViolenceEvaluator(azure_ai_project) |
| 91 | +result = violence_evaluator( |
| 92 | + query="What is the capital of France?", |
| 93 | + response="Paris." |
| 94 | +) |
| 95 | +``` |
| 96 | + |
| 97 | +#### Custom evaluators |
| 98 | + |
| 99 | +Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs. |
44 | 100 |
|
| 101 | +```python |
45 | 102 |
|
| 103 | +# Custom evaluator as a function to calculate response length |
46 | 104 | def response_length(response, **kwargs): |
47 | | - return {"value": len(response)} |
| 105 | + return len(response) |
48 | 106 |
|
| 107 | +# Custom class based evaluator to check for blocked words |
| 108 | +class BlocklistEvaluator: |
| 109 | + def __init__(self, blocklist): |
| 110 | + self._blocklist = blocklist |
49 | 111 |
|
50 | | -if __name__ == "__main__": |
51 | | - # Built-in evaluators |
52 | | - # Initialize Azure OpenAI Model Configuration |
53 | | - model_config = { |
54 | | - "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"), |
55 | | - "api_key": os.environ.get("AZURE_OPENAI_KEY"), |
56 | | - "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"), |
57 | | - } |
| 112 | + def __call__(self, *, response: str, **kwargs): |
| 113 | + score = any([word in answer for word in self._blocklist]) |
| 114 | + return {"score": score} |
58 | 115 |
|
59 | | - # Initialzing Relevance Evaluator |
60 | | - relevance_eval = RelevanceEvaluator(model_config) |
| 116 | +blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"]) |
61 | 117 |
|
62 | | - # Running Relevance Evaluator on single input row |
63 | | - relevance_score = relevance_eval( |
64 | | - response="The Alpine Explorer Tent is the most waterproof.", |
65 | | - query="Which tent is the most waterproof?", |
66 | | - ) |
| 118 | +result = response_length("The capital of Japan is Tokyo.") |
| 119 | +result = blocklist_evaluator(answer="The capital of Japan is Tokyo.") |
67 | 120 |
|
68 | | - pprint(relevance_score) |
69 | | - # {'gpt_relevance': 5.0} |
| 121 | +``` |
| 122 | + |
| 123 | +### Evaluate API |
| 124 | +The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response. |
70 | 125 |
|
71 | | - # Content Safety Evaluator |
| 126 | +#### Evaluate existing dataset |
72 | 127 |
|
73 | | - # Initialize Project Scope |
74 | | - azure_ai_project = { |
75 | | - "subscription_id": <subscription_id>, |
76 | | - "resource_group_name": <resource_group_name>, |
77 | | - "project_name": <project_name> |
| 128 | +```python |
| 129 | +from azure.ai.evaluation import evaluate |
| 130 | + |
| 131 | +result = evaluate( |
| 132 | + data="data.jsonl", # provide your data here |
| 133 | + evaluators={ |
| 134 | + "blocklist": blocklist_evaluator, |
| 135 | + "relevance": relevance_evaluator |
| 136 | + }, |
| 137 | + # column mapping |
| 138 | + evaluator_config={ |
| 139 | + "relevance": { |
| 140 | + "column_mapping": { |
| 141 | + "query": "${data.queries}" |
| 142 | + "ground_truth": "${data.ground_truth}" |
| 143 | + "response": "${outputs.response}" |
| 144 | + } |
| 145 | + } |
78 | 146 | } |
| 147 | + # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project |
| 148 | + azure_ai_project = azure_ai_project, |
| 149 | + # Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL |
| 150 | + output_path="./evaluation_results.json" |
| 151 | +) |
| 152 | +``` |
| 153 | +For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset] |
79 | 154 |
|
80 | | - violence_eval = ViolenceEvaluator(azure_ai_project) |
81 | | - violence_score = violence_eval(query="What is the capital of France?", response="Paris.") |
82 | | - pprint(violence_score) |
83 | | - # {'violence': 'Very low', |
84 | | - # 'violence_reason': "The system's response is a straightforward factual response " |
85 | | - # 'to a geography question. There is no violent content or ' |
86 | | - # 'language present.', |
87 | | - # 'violence_score': 0} |
88 | | - |
89 | | - # Code based evaluator |
90 | | - response_length("The Alpine Explorer Tent is the most waterproof.") |
91 | | - # {'value': 48} |
92 | | - |
93 | | - # Using multiple evaluators together using `Evaluate` API |
94 | | - |
95 | | - result = evaluate( |
96 | | - data="evaluate_test_data.jsonl", |
97 | | - evaluators={ |
98 | | - "response_length": response_length, |
99 | | - "violence": violence_eval, |
100 | | - }, |
101 | | - ) |
| 155 | +#### Evaluate generative AI application |
| 156 | +```python |
| 157 | +from askwiki import askwiki |
102 | 158 |
|
103 | | - pprint(result) |
| 159 | +result = evaluate( |
| 160 | + data="data.jsonl", |
| 161 | + target=askwiki, |
| 162 | + evaluators={ |
| 163 | + "relevance": relevance_eval |
| 164 | + }, |
| 165 | + evaluator_config={ |
| 166 | + "default": { |
| 167 | + "column_mapping": { |
| 168 | + "query": "${data.queries}" |
| 169 | + "context": "${outputs.context}" |
| 170 | + "response": "${outputs.response}" |
| 171 | + } |
| 172 | + } |
| 173 | + } |
| 174 | +) |
104 | 175 | ``` |
| 176 | +Above code snippet refers to askwiki application in this [sample][evaluate_app]. |
| 177 | + |
| 178 | +For more details refer to [Evaluate on a target][evaluate_target] |
| 179 | + |
105 | 180 | ### Simulator |
106 | 181 |
|
107 | 182 |
|
@@ -422,11 +497,21 @@ outputs = asyncio.run( |
422 | 497 |
|
423 | 498 | print(outputs) |
424 | 499 | ``` |
| 500 | + |
| 501 | +## Examples |
| 502 | + |
| 503 | +In following section you will find examples of: |
| 504 | +- [Evaluate an application][evaluate_app] |
| 505 | +- [Evaluate different models][evaluate_models] |
| 506 | +- [Custom Evaluators][custom_evaluators] |
| 507 | + |
| 508 | +More examples can be found [here][evaluate_samples]. |
| 509 | + |
425 | 510 | ## Troubleshooting |
426 | 511 |
|
427 | 512 | ### General |
428 | 513 |
|
429 | | -Azure ML clients raise exceptions defined in [Azure Core][azure_core_readme]. |
| 514 | +Please refer to [troubleshooting][evaluation_tsg] for common issues. |
430 | 515 |
|
431 | 516 | ### Logging |
432 | 517 |
|
@@ -471,3 +556,19 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con |
471 | 556 | [code_of_conduct]: https://opensource.microsoft.com/codeofconduct/ |
472 | 557 | [coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/ |
473 | 558 | [coc_contact]: mailto:[email protected] |
| 559 | +[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target |
| 560 | +[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate |
| 561 | +[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview |
| 562 | +[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate |
| 563 | +[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_app |
| 564 | +[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md |
| 565 | +[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio |
| 566 | +[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio |
| 567 | +[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/ |
| 568 | +[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_endpoints |
| 569 | +[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_custom |
| 570 | +[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate |
| 571 | +[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in |
| 572 | +[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators |
| 573 | +[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators |
| 574 | +[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators |
0 commit comments