Skip to content

Commit a770908

Browse files
authored
azure-ai-evaluation README update (Azure#38416)
* Updating read me * Review feedback * Fixing evaluate API appearing twice * Adding remote extra installation instructions * Fixing missing link * Review feeback * Fixing broken link and spell check errors * Fixing links * Fixing warnings from doc buidl
1 parent 6d1fb48 commit a770908

File tree

1 file changed

+160
-59
lines changed

1 file changed

+160
-59
lines changed

sdk/evaluation/azure-ai-evaluation/README.md

Lines changed: 160 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,182 @@
11
# Azure AI Evaluation client library for Python
22

3-
We are excited to introduce the public preview of the Azure AI Evaluation SDK.
3+
Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.
4+
5+
Use Azure AI Evaluation SDK to:
6+
- Evaluate existing data from generative AI applications
7+
- Evaluate generative AI applications
8+
- Evaluate by generating mathematical, AI-assisted quality and safety metrics
9+
10+
Azure AI SDK provides following to evaluate Generative AI Applications:
11+
- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
12+
- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.
413

514
[Source code][source_code]
615
| [Package (PyPI)][evaluation_pypi]
716
| [API reference documentation][evaluation_ref_docs]
817
| [Product documentation][product_documentation]
918
| [Samples][evaluation_samples]
1019

11-
This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.
12-
13-
For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all
1420

1521
## Getting started
1622

1723
### Prerequisites
1824

1925
- Python 3.8 or later is required to use this package.
26+
- [Optional] You must have [Azure AI Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators
2027

2128
### Install the package
2229

23-
Install the Azure AI Evaluation library for Python with [pip][pip_link]::
30+
Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:
2431

2532
```bash
2633
pip install azure-ai-evaluation
2734
```
35+
If you want to track results in [AI Studio][ai_studio], install `remote` extra:
36+
```python
37+
pip install azure-ai-evaluation[remote]
38+
```
2839

2940
## Key concepts
3041

31-
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models.
42+
### Evaluators
3243

33-
## Examples
44+
Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.
3445

35-
### Evaluators
46+
#### Built-in evaluators
47+
48+
Built-in evaluators are out of box evaluators provided by Microsoft:
49+
| Category | Evaluator class |
50+
|-----------|------------------------------------------------------------------------------------------------------------------------------------|
51+
| [Performance and quality][performance_and_quality_evaluators] (AI-assisted) | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
52+
| [Performance and quality][performance_and_quality_evaluators] (NLP) | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
53+
| [Risk and safety][risk_and_safety_evaluators] (AI-assisted) | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator` |
54+
| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator` |
3655

37-
Users can create evaluator runs on the local machine as shown in the example below:
56+
For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].
3857

3958
```python
4059
import os
41-
from pprint import pprint
4260

43-
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator
61+
from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator
62+
63+
# NLP bleu score evaluator
64+
bleu_score_evaluator = BleuScoreEvaluator()
65+
result = bleu_score(
66+
response="Tokyo is the capital of Japan.",
67+
ground_truth="The capital of Japan is Tokyo."
68+
)
69+
70+
# AI assisted quality evaluator
71+
model_config = {
72+
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
73+
"api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
74+
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
75+
}
76+
77+
relevance_evaluator = RelevanceEvaluator(model_config)
78+
result = relevance_evaluator(
79+
query="What is the capital of Japan?",
80+
response="The capital of Japan is Tokyo."
81+
)
82+
83+
# AI assisted safety evaluator
84+
azure_ai_project = {
85+
"subscription_id": "<subscription_id>",
86+
"resource_group_name": "<resource_group_name>",
87+
"project_name": "<project_name>",
88+
}
89+
90+
violence_evaluator = ViolenceEvaluator(azure_ai_project)
91+
result = violence_evaluator(
92+
query="What is the capital of France?",
93+
response="Paris."
94+
)
95+
```
96+
97+
#### Custom evaluators
98+
99+
Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.
44100

101+
```python
45102

103+
# Custom evaluator as a function to calculate response length
46104
def response_length(response, **kwargs):
47-
return {"value": len(response)}
105+
return len(response)
48106

107+
# Custom class based evaluator to check for blocked words
108+
class BlocklistEvaluator:
109+
def __init__(self, blocklist):
110+
self._blocklist = blocklist
49111

50-
if __name__ == "__main__":
51-
# Built-in evaluators
52-
# Initialize Azure OpenAI Model Configuration
53-
model_config = {
54-
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
55-
"api_key": os.environ.get("AZURE_OPENAI_KEY"),
56-
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
57-
}
112+
def __call__(self, *, response: str, **kwargs):
113+
score = any([word in answer for word in self._blocklist])
114+
return {"score": score}
58115

59-
# Initialzing Relevance Evaluator
60-
relevance_eval = RelevanceEvaluator(model_config)
116+
blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])
61117

62-
# Running Relevance Evaluator on single input row
63-
relevance_score = relevance_eval(
64-
response="The Alpine Explorer Tent is the most waterproof.",
65-
query="Which tent is the most waterproof?",
66-
)
118+
result = response_length("The capital of Japan is Tokyo.")
119+
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")
67120

68-
pprint(relevance_score)
69-
# {'gpt_relevance': 5.0}
121+
```
122+
123+
### Evaluate API
124+
The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.
70125

71-
# Content Safety Evaluator
126+
#### Evaluate existing dataset
72127

73-
# Initialize Project Scope
74-
azure_ai_project = {
75-
"subscription_id": <subscription_id>,
76-
"resource_group_name": <resource_group_name>,
77-
"project_name": <project_name>
128+
```python
129+
from azure.ai.evaluation import evaluate
130+
131+
result = evaluate(
132+
data="data.jsonl", # provide your data here
133+
evaluators={
134+
"blocklist": blocklist_evaluator,
135+
"relevance": relevance_evaluator
136+
},
137+
# column mapping
138+
evaluator_config={
139+
"relevance": {
140+
"column_mapping": {
141+
"query": "${data.queries}"
142+
"ground_truth": "${data.ground_truth}"
143+
"response": "${outputs.response}"
144+
}
145+
}
78146
}
147+
# Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
148+
azure_ai_project = azure_ai_project,
149+
# Optionally provide an output path to dump a json of metric summary, row level data and metric and studio URL
150+
output_path="./evaluation_results.json"
151+
)
152+
```
153+
For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]
79154

80-
violence_eval = ViolenceEvaluator(azure_ai_project)
81-
violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
82-
pprint(violence_score)
83-
# {'violence': 'Very low',
84-
# 'violence_reason': "The system's response is a straightforward factual response "
85-
# 'to a geography question. There is no violent content or '
86-
# 'language present.',
87-
# 'violence_score': 0}
88-
89-
# Code based evaluator
90-
response_length("The Alpine Explorer Tent is the most waterproof.")
91-
# {'value': 48}
92-
93-
# Using multiple evaluators together using `Evaluate` API
94-
95-
result = evaluate(
96-
data="evaluate_test_data.jsonl",
97-
evaluators={
98-
"response_length": response_length,
99-
"violence": violence_eval,
100-
},
101-
)
155+
#### Evaluate generative AI application
156+
```python
157+
from askwiki import askwiki
102158

103-
pprint(result)
159+
result = evaluate(
160+
data="data.jsonl",
161+
target=askwiki,
162+
evaluators={
163+
"relevance": relevance_eval
164+
},
165+
evaluator_config={
166+
"default": {
167+
"column_mapping": {
168+
"query": "${data.queries}"
169+
"context": "${outputs.context}"
170+
"response": "${outputs.response}"
171+
}
172+
}
173+
}
174+
)
104175
```
176+
Above code snippet refers to askwiki application in this [sample][evaluate_app].
177+
178+
For more details refer to [Evaluate on a target][evaluate_target]
179+
105180
### Simulator
106181

107182

@@ -422,11 +497,21 @@ outputs = asyncio.run(
422497

423498
print(outputs)
424499
```
500+
501+
## Examples
502+
503+
In following section you will find examples of:
504+
- [Evaluate an application][evaluate_app]
505+
- [Evaluate different models][evaluate_models]
506+
- [Custom Evaluators][custom_evaluators]
507+
508+
More examples can be found [here][evaluate_samples].
509+
425510
## Troubleshooting
426511

427512
### General
428513

429-
Azure ML clients raise exceptions defined in [Azure Core][azure_core_readme].
514+
Please refer to [troubleshooting][evaluation_tsg] for common issues.
430515

431516
### Logging
432517

@@ -471,3 +556,19 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
471556
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
472557
[coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
473558
[coc_contact]: mailto:[email protected]
559+
[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
560+
[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
561+
[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
562+
[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
563+
[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_app
564+
[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
565+
[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
566+
[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
567+
[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
568+
[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_endpoints
569+
[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/evaluate_custom
570+
[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
571+
[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
572+
[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
573+
[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
574+
[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators

0 commit comments

Comments
 (0)