Skip to content

Commit 314e2f7

Browse files
committed
Add evaluate flow as well
1 parent 8849656 commit 314e2f7

13 files changed

+333
-45
lines changed

docs/evaluation.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Evaluating the RAG answer quality
2+
3+
Follow these steps to evaluate the quality of the answers generated by the RAG flow.
4+
5+
* [Deploy a GPT-4 model](#deploy-a-gpt-4-model)
6+
* [Setup the evaluation environment](#setup-the-evaluation-environment)
7+
* [Generate ground truth data](#generate-ground-truth-data)
8+
* [Run bulk evaluation](#run-bulk-evaluation)
9+
* [Review the evaluation results](#review-the-evaluation-results)
10+
* [Run bulk evaluation on a PR](#run-bulk-evaluation-on-a-pr)
11+
12+
## Deploy a GPT-4 model
13+
14+
15+
1. Run this command to tell `azd` to deploy a GPT-4 level model for evaluation:
16+
17+
```shell
18+
azd env set USE_EVAL true
19+
```
20+
21+
2. Set the capacity to the highest possible value to ensure that the evaluation runs quickly.
22+
23+
```shell
24+
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
25+
```
26+
27+
By default, that will provision a `gpt-4o` model, version `2024-08-06`. To change those settings, set the azd environment variables `AZURE_OPENAI_EVAL_MODEL` and `AZURE_OPENAI_EVAL_MODEL_VERSION` to the desired values.
28+
29+
3. Then, run the following command to provision the model:
30+
31+
```shell
32+
azd provision
33+
```
34+
35+
## Setup the evaluation environment
36+
37+
Install all the dependencies for the evaluation script by running the following command:
38+
39+
```bash
40+
pip install -r evals/requirements.txt
41+
```
42+
43+
## Generate ground truth data
44+
45+
Modify the search terms and tasks in `evals/generate_config.json` to match your domain.
46+
47+
Generate ground truth data by running the following command:
48+
49+
```bash
50+
python evals/generate_ground_truth_data.py
51+
```
52+
53+
Review the generated data in `evals/ground_truth.jsonl` after running that script, removing any question/answer pairs that don't seem like realistic user input.
54+
55+
## Run bulk evaluation
56+
57+
Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
58+
59+
By default, the evaluation script will evaluate every question in the ground truth data.
60+
Run the evaluation script by running the following command:
61+
62+
```bash
63+
python evals/evaluate.py
64+
```
65+
66+
## Review the evaluation results
67+
68+
The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
69+
70+
You can see a summary of results across all evaluation runs by running the following command:
71+
72+
```bash
73+
python -m evaltools summary evals/results
74+
```
75+
76+
Compare answers across runs by running the following command:
77+
78+
```bash
79+
python -m evaltools diff evals/results/baseline/
80+
```
81+
82+
## Run bulk evaluation on a PR
83+
84+
To run the evaluation on the changes in a PR, you can add a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.

evals/evaluate.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
import argparse
2+
import logging
3+
import os
4+
from pathlib import Path
5+
6+
from azure.identity import AzureDeveloperCliCredential
7+
from dotenv_azd import load_azd_env
8+
from evaltools.eval.evaluate import run_evaluate_from_config
9+
from rich.logging import RichHandler
10+
11+
logger = logging.getLogger("ragapp")
12+
13+
14+
def get_openai_config():
15+
azure_endpoint = f"https://{os.getenv('AZURE_OPENAI_SERVICE')}.openai.azure.com"
16+
azure_deployment = os.environ["AZURE_OPENAI_EVAL_DEPLOYMENT"]
17+
openai_config = {"azure_endpoint": azure_endpoint, "azure_deployment": azure_deployment}
18+
# azure-ai-evaluate will call DefaultAzureCredential behind the scenes,
19+
# so we must be logged in to Azure CLI with the correct tenant
20+
return openai_config
21+
22+
23+
def get_azure_credential():
24+
AZURE_TENANT_ID = os.getenv("AZURE_TENANT_ID")
25+
if AZURE_TENANT_ID:
26+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential with tenant_id %s", AZURE_TENANT_ID)
27+
azure_credential = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID, process_timeout=60)
28+
else:
29+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential for home tenant")
30+
azure_credential = AzureDeveloperCliCredential(process_timeout=60)
31+
return azure_credential
32+
33+
34+
if __name__ == "__main__":
35+
logging.basicConfig(
36+
level=logging.INFO, format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)]
37+
)
38+
load_azd_env()
39+
40+
parser = argparse.ArgumentParser(description="Run evaluation with OpenAI configuration.")
41+
parser.add_argument("--targeturl", type=str, help="Specify the target URL.")
42+
parser.add_argument("--resultsdir", type=Path, help="Specify the results directory.")
43+
parser.add_argument("--numquestions", type=int, help="Specify the number of questions.")
44+
45+
args = parser.parse_args()
46+
47+
openai_config = get_openai_config()
48+
49+
run_evaluate_from_config(
50+
working_dir=Path(__file__).parent,
51+
config_path="evaluate_config.json",
52+
num_questions=args.numquestions,
53+
target_url=args.targeturl,
54+
results_dir=args.resultsdir,
55+
openai_config=openai_config,
56+
model=os.environ["AZURE_OPENAI_EVAL_MODEL"],
57+
azure_credential=get_azure_credential(),
58+
)

evals/evaluate_config.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"testdata_path": "ground_truth.jsonl",
3+
"results_dir": "results/experiment<TIMESTAMP>",
4+
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "gpt_coherence", "answer_length", "latency"],
5+
"target_url": "http://localhost:50505/chat",
6+
"target_parameters": {
7+
"overrides": {
8+
"top": 3,
9+
"temperature": 0.3,
10+
"minimum_reranker_score": 0,
11+
"minimum_search_score": 0,
12+
"retrieval_mode": "hybrid",
13+
"semantic_ranker": true,
14+
"semantic_captions": false,
15+
"suggest_followup_questions": false,
16+
"use_oid_security_filter": false,
17+
"use_groups_security_filter": false,
18+
"vector_fields": [
19+
"embedding"
20+
],
21+
"use_gpt4v": false,
22+
"gpt4v_input": "textAndImages",
23+
"seed": 1
24+
}
25+
},
26+
"target_response_answer_jmespath": "message.content",
27+
"target_response_context_jmespath": "context.data_points.text"
28+
}

evals/generate_config.json

Lines changed: 35 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,37 @@
11
{
2-
"num_per_task": 2,
3-
"simulations": [
4-
{"search_term": "benefits",
5-
"tasks": [
6-
"I am a new employee and I want to learn about benefits.",
7-
"I am a new employee and I want to enroll in benefits.",
8-
"I am a new parent and I want to learn about benefits."
9-
]
10-
},
11-
{"search_term": "policies",
12-
"tasks": [
13-
"I am a new employee and I want to learn about policies.",
14-
"I am a new employee and I want to learn about the dress code.",
15-
"I am a new employee and I want to learn about the vacation policy."
16-
]
17-
},
18-
{"search_term": "payroll",
19-
"tasks": [
20-
"I am a new employee and I want to learn about payroll.",
21-
"I am a new employee and I want to learn about direct deposit.",
22-
"I am a new employee and I want to learn about pay stubs."
23-
]
24-
},
25-
{"search_term": "careers",
26-
"tasks": [
27-
"I am a new employee and I want to learn about career opportunities.",
28-
"I am a new employee and I want to learn about the promotion process.",
29-
"I am a new employee and I want to learn about the training program."
30-
]
31-
}
32-
]
2+
"num_per_task": 2,
3+
"simulations": [
4+
{
5+
"search_term": "benefits",
6+
"tasks": [
7+
"I am a new employee and I want to learn about benefits.",
8+
"I am a new employee and I want to enroll in benefits.",
9+
"I am a new parent and I want to learn about benefits."
10+
]
11+
},
12+
{
13+
"search_term": "policies",
14+
"tasks": [
15+
"I am a new employee and I want to learn about policies.",
16+
"I am a new employee and I want to learn about the dress code.",
17+
"I am a new employee and I want to learn about the vacation policy."
18+
]
19+
},
20+
{
21+
"search_term": "payroll",
22+
"tasks": [
23+
"I am a new employee and I want to learn about payroll.",
24+
"I am a new employee and I want to learn about direct deposit.",
25+
"I am a new employee and I want to learn about pay stubs."
26+
]
27+
},
28+
{
29+
"search_term": "careers",
30+
"tasks": [
31+
"I am a new employee and I want to learn about career opportunities.",
32+
"I am a new employee and I want to learn about the promotion process.",
33+
"I am a new employee and I want to learn about the training program."
34+
]
35+
}
36+
]
3337
}

evals/generate_ground_truth.py

Lines changed: 2 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,12 @@
77

88
import requests
99
from azure.ai.evaluation.simulator import Simulator
10-
from azure.identity import AzureDeveloperCliCredential, get_bearer_token_provider
10+
from azure.identity import AzureDeveloperCliCredential
1111
from azure.search.documents import SearchClient
1212
from azure.search.documents.models import (
1313
QueryType,
1414
)
1515
from dotenv_azd import load_azd_env
16-
from openai import AzureOpenAI
1716

1817
logger = logging.getLogger("evals")
1918

@@ -72,17 +71,6 @@ def get_simulator() -> Simulator:
7271
return simulator
7372

7473

75-
def get_openai_client():
76-
azure_credential = get_azure_credential()
77-
token_provider = get_bearer_token_provider(azure_credential, "https://cognitiveservices.azure.com/.default")
78-
openai_client = AzureOpenAI(
79-
api_version=os.getenv("AZURE_OPENAI_API_VERSION") or "2024-06-01",
80-
azure_endpoint=f"https://{os.getenv('AZURE_OPENAI_SERVICE')}.openai.azure.com",
81-
azure_ad_token_provider=token_provider,
82-
)
83-
return openai_client
84-
85-
8674
def get_azure_credential():
8775
AZURE_TENANT_ID = os.getenv("AZURE_TENANT_ID")
8876
if AZURE_TENANT_ID:
@@ -135,7 +123,7 @@ async def generate_ground_truth(azure_credential, simulations: list[dict], num_p
135123
qa_pairs = []
136124
for output in outputs:
137125
qa_pairs.append({"question": output["messages"][0]["content"], "truth": output["messages"][1]["content"]})
138-
with open(CURRENT_DIR / "ground_truth_singleturn.jsonl", "a") as f:
126+
with open(CURRENT_DIR / "ground_truth.jsonl", "a") as f:
139127
for qa_pair in qa_pairs:
140128
f.write(json.dumps(qa_pair) + "\n")
141129

File renamed without changes.

evals/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
dotenv-azd==0.2.0
22
azure-ai-evaluation==1.0.1
33
rich
4+
git+https://github.com/Azure-Samples/ai-rag-chat-evaluator

evals/results/baseline/config.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"testdata_path": "ground_truth.jsonl",
3+
"results_dir": "results/experiment<TIMESTAMP>",
4+
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "gpt_coherence", "answer_length", "latency"],
5+
"target_url": "http://localhost:50505/chat",
6+
"target_parameters": {
7+
"overrides": {
8+
"top": 3,
9+
"temperature": 0.3,
10+
"minimum_reranker_score": 0,
11+
"minimum_search_score": 0,
12+
"retrieval_mode": "hybrid",
13+
"semantic_ranker": true,
14+
"semantic_captions": false,
15+
"suggest_followup_questions": false,
16+
"use_oid_security_filter": false,
17+
"use_groups_security_filter": false,
18+
"vector_fields": [
19+
"embedding"
20+
],
21+
"use_gpt4v": false,
22+
"gpt4v_input": "textAndImages",
23+
"seed": 1
24+
}
25+
},
26+
"target_response_answer_jmespath": "message.content",
27+
"target_response_context_jmespath": "context.data_points.text"
28+
}

0 commit comments

Comments
 (0)