Skip to content

Commit 56f627b

Browse files
pamelafoxdfl-aeb
authored andcommitted
Make it easy to run evaluation directly from this repo (Azure-Samples#2233)
* Updating docs * Update requirements.txt * Update diagram * Add typing extensions explicitly * Adding ground truth generation * Add evaluate flow as well * Add RAGAS * Add RAGAS * Remove simulator * Improvements to RAGAS code * More logging, save knowledge graph after transforms * Update baseline, add citations matched metric, use separate venv for eval * Update the requirements to latest tag * Logger fixes
1 parent 315d6a6 commit 56f627b

14 files changed

+837902
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ celerybeat.pid
111111
# Environments
112112
.env
113113
.venv
114+
.evalenv
114115
env/
115116
venv/
116117
ENV/

docs/evaluation.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Evaluating the RAG answer quality
2+
3+
Follow these steps to evaluate the quality of the answers generated by the RAG flow.
4+
5+
* [Deploy an evaluation model](#deploy-an-evaluation-model)
6+
* [Setup the evaluation environment](#setup-the-evaluation-environment)
7+
* [Generate ground truth data](#generate-ground-truth-data)
8+
* [Run bulk evaluation](#run-bulk-evaluation)
9+
* [Review the evaluation results](#review-the-evaluation-results)
10+
* [Run bulk evaluation on a PR](#run-bulk-evaluation-on-a-pr)
11+
12+
## Deploy an evaluation model
13+
14+
1. Run this command to tell `azd` to deploy a GPT-4 level model for evaluation:
15+
16+
```shell
17+
azd env set USE_EVAL true
18+
```
19+
20+
2. Set the capacity to the highest possible value to ensure that the evaluation runs relatively quickly. Even with a high capacity, it can take a long time to generate ground truth data and run bulk evaluations.
21+
22+
```shell
23+
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
24+
```
25+
26+
By default, that will provision a `gpt-4o` model, version `2024-08-06`. To change those settings, set the azd environment variables `AZURE_OPENAI_EVAL_MODEL` and `AZURE_OPENAI_EVAL_MODEL_VERSION` to the desired values.
27+
28+
3. Then, run the following command to provision the model:
29+
30+
```shell
31+
azd provision
32+
```
33+
34+
## Setup the evaluation environment
35+
36+
Make a new Python virtual environment and activate it. This is currently required due to incompatibilities between the dependencies of the evaluation script and the main project.
37+
38+
```bash
39+
python -m venv .evalenv
40+
```
41+
42+
```bash
43+
source .evalenv/bin/activate
44+
```
45+
46+
Install all the dependencies for the evaluation script by running the following command:
47+
48+
```bash
49+
pip install -r evals/requirements.txt
50+
```
51+
52+
## Generate ground truth data
53+
54+
Modify the search terms and tasks in `evals/generate_config.json` to match your domain.
55+
56+
Generate ground truth data by running the following command:
57+
58+
```bash
59+
python evals/generate_ground_truth.py --numquestions=200 --numsearchdocs=1000
60+
```
61+
62+
The options are:
63+
64+
* `numquestions`: The number of questions to generate. We suggest at least 200.
65+
* `numsearchdocs`: The number of documents (chunks) to retrieve from your search index. You can leave off the option to fetch all documents, but that will significantly increase time it takes to generate ground truth data. You may want to at least start with a subset.
66+
* `kgfile`: An existing RAGAS knowledge base JSON file, which is usually `ground_truth_kg.json`. You may want to specify this if you already created a knowledge base and just want to tweak the question generation steps.
67+
* `groundtruthfile`: The file to write the generated ground truth answwers. By default, this is `evals/ground_truth.jsonl`.
68+
69+
🕰️ This may take a long time, possibly several hours, depending on the size of the search index.
70+
71+
Review the generated data in `evals/ground_truth.jsonl` after running that script, removing any question/answer pairs that don't seem like realistic user input.
72+
73+
## Run bulk evaluation
74+
75+
Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
76+
77+
By default, the evaluation script will evaluate every question in the ground truth data.
78+
Run the evaluation script by running the following command:
79+
80+
```bash
81+
python evals/evaluate.py
82+
```
83+
84+
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions. You can specify `--numquestions` argument for a test run on a subset of the questions.
85+
86+
## Review the evaluation results
87+
88+
The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
89+
90+
You can see a summary of results across all evaluation runs by running the following command:
91+
92+
```bash
93+
python -m evaltools summary evals/results
94+
```
95+
96+
Compare answers across runs by running the following command:
97+
98+
```bash
99+
python -m evaltools diff evals/results/baseline/
100+
```
101+
102+
## Run bulk evaluation on a PR
103+
104+
To run the evaluation on the changes in a PR, you can add a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.

evals/evaluate.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
import argparse
2+
import logging
3+
import os
4+
import re
5+
from pathlib import Path
6+
7+
from azure.identity import AzureDeveloperCliCredential
8+
from dotenv_azd import load_azd_env
9+
from evaltools.eval.evaluate import run_evaluate_from_config
10+
from evaltools.eval.evaluate_metrics import register_metric
11+
from evaltools.eval.evaluate_metrics.base_metric import BaseMetric
12+
from rich.logging import RichHandler
13+
14+
logger = logging.getLogger("ragapp")
15+
16+
17+
class CitationsMatchedMetric(BaseMetric):
18+
METRIC_NAME = "citations_matched"
19+
20+
@classmethod
21+
def evaluator_fn(cls, **kwargs):
22+
def citations_matched(*, response, ground_truth, **kwargs):
23+
if response is None:
24+
logger.warning("Received response of None, can't compute citation_match metric. Setting to -1.")
25+
return {cls.METRIC_NAME: -1}
26+
# Return true if all citations in the truth are present in the response
27+
truth_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", ground_truth))
28+
response_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))
29+
# Count the percentage of citations that are present in the response
30+
num_citations = len(truth_citations)
31+
num_matched_citations = len(truth_citations.intersection(response_citations))
32+
return {cls.METRIC_NAME: num_matched_citations / num_citations}
33+
34+
return citations_matched
35+
36+
@classmethod
37+
def get_aggregate_stats(cls, df):
38+
df = df[df[cls.METRIC_NAME] != -1]
39+
return {
40+
"total": int(df[cls.METRIC_NAME].sum()),
41+
"rate": round(df[cls.METRIC_NAME].mean(), 2),
42+
}
43+
44+
45+
def get_openai_config():
46+
azure_endpoint = f"https://{os.getenv('AZURE_OPENAI_SERVICE')}.openai.azure.com"
47+
azure_deployment = os.environ["AZURE_OPENAI_EVAL_DEPLOYMENT"]
48+
openai_config = {"azure_endpoint": azure_endpoint, "azure_deployment": azure_deployment}
49+
# azure-ai-evaluate will call DefaultAzureCredential behind the scenes,
50+
# so we must be logged in to Azure CLI with the correct tenant
51+
return openai_config
52+
53+
54+
def get_azure_credential():
55+
AZURE_TENANT_ID = os.getenv("AZURE_TENANT_ID")
56+
if AZURE_TENANT_ID:
57+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential with tenant_id %s", AZURE_TENANT_ID)
58+
azure_credential = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID, process_timeout=60)
59+
else:
60+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential for home tenant")
61+
azure_credential = AzureDeveloperCliCredential(process_timeout=60)
62+
return azure_credential
63+
64+
65+
if __name__ == "__main__":
66+
logging.basicConfig(
67+
level=logging.WARNING, format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)]
68+
)
69+
logger.setLevel(logging.INFO)
70+
logging.getLogger("evaltools").setLevel(logging.INFO)
71+
load_azd_env()
72+
73+
parser = argparse.ArgumentParser(description="Run evaluation with OpenAI configuration.")
74+
parser.add_argument("--targeturl", type=str, help="Specify the target URL.")
75+
parser.add_argument("--resultsdir", type=Path, help="Specify the results directory.")
76+
parser.add_argument("--numquestions", type=int, help="Specify the number of questions.")
77+
78+
args = parser.parse_args()
79+
80+
openai_config = get_openai_config()
81+
82+
register_metric(CitationsMatchedMetric)
83+
run_evaluate_from_config(
84+
working_dir=Path(__file__).parent,
85+
config_path="evaluate_config.json",
86+
num_questions=args.numquestions,
87+
target_url=args.targeturl,
88+
results_dir=args.resultsdir,
89+
openai_config=openai_config,
90+
model=os.environ["AZURE_OPENAI_EVAL_MODEL"],
91+
azure_credential=get_azure_credential(),
92+
)

evals/evaluate_config.json

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"testdata_path": "ground_truth.jsonl",
3+
"results_dir": "results/experiment<TIMESTAMP>",
4+
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched"],
5+
"target_url": "http://localhost:50505/chat",
6+
"target_parameters": {
7+
"overrides": {
8+
"top": 3,
9+
"temperature": 0.3,
10+
"minimum_reranker_score": 0,
11+
"minimum_search_score": 0,
12+
"retrieval_mode": "hybrid",
13+
"semantic_ranker": true,
14+
"semantic_captions": false,
15+
"suggest_followup_questions": false,
16+
"use_oid_security_filter": false,
17+
"use_groups_security_filter": false,
18+
"vector_fields": [
19+
"embedding"
20+
],
21+
"use_gpt4v": false,
22+
"gpt4v_input": "textAndImages",
23+
"seed": 1
24+
}
25+
},
26+
"target_response_answer_jmespath": "message.content",
27+
"target_response_context_jmespath": "context.data_points.text"
28+
}

evals/generate_ground_truth.py

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
import argparse
2+
import json
3+
import logging
4+
import os
5+
import pathlib
6+
import re
7+
8+
from azure.identity import AzureDeveloperCliCredential, get_bearer_token_provider
9+
from azure.search.documents import SearchClient
10+
from dotenv_azd import load_azd_env
11+
from langchain_core.documents import Document as LCDocument
12+
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
13+
from ragas.embeddings import LangchainEmbeddingsWrapper
14+
from ragas.llms import LangchainLLMWrapper
15+
from ragas.testset import TestsetGenerator
16+
from ragas.testset.graph import KnowledgeGraph, Node, NodeType
17+
from ragas.testset.transforms import apply_transforms, default_transforms
18+
from rich.logging import RichHandler
19+
20+
logger = logging.getLogger("ragapp")
21+
22+
root_dir = pathlib.Path(__file__).parent
23+
24+
25+
def get_azure_credential():
26+
AZURE_TENANT_ID = os.getenv("AZURE_TENANT_ID")
27+
if AZURE_TENANT_ID:
28+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential with tenant_id %s", AZURE_TENANT_ID)
29+
azure_credential = AzureDeveloperCliCredential(tenant_id=AZURE_TENANT_ID, process_timeout=60)
30+
else:
31+
logger.info("Setting up Azure credential using AzureDeveloperCliCredential for home tenant")
32+
azure_credential = AzureDeveloperCliCredential(process_timeout=60)
33+
return azure_credential
34+
35+
36+
def get_search_documents(azure_credential, num_search_documents=None) -> str:
37+
search_client = SearchClient(
38+
endpoint=f"https://{os.getenv('AZURE_SEARCH_SERVICE')}.search.windows.net",
39+
index_name=os.getenv("AZURE_SEARCH_INDEX"),
40+
credential=azure_credential,
41+
)
42+
all_documents = []
43+
if num_search_documents is None:
44+
logger.info("Fetching all document chunks from Azure AI Search")
45+
num_search_documents = 100000
46+
else:
47+
logger.info("Fetching %d document chunks from Azure AI Search", num_search_documents)
48+
response = search_client.search(search_text="*", top=num_search_documents).by_page()
49+
for page in response:
50+
page = list(page)
51+
all_documents.extend(page)
52+
return all_documents
53+
54+
55+
def generate_ground_truth_ragas(num_questions=200, num_search_documents=None, kg_file=None):
56+
azure_credential = get_azure_credential()
57+
azure_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION") or "2024-06-01"
58+
azure_endpoint = f"https://{os.getenv('AZURE_OPENAI_SERVICE')}.openai.azure.com"
59+
azure_ad_token_provider = get_bearer_token_provider(
60+
azure_credential, "https://cognitiveservices.azure.com/.default"
61+
)
62+
generator_llm = LangchainLLMWrapper(
63+
AzureChatOpenAI(
64+
openai_api_version=azure_openai_api_version,
65+
azure_endpoint=azure_endpoint,
66+
azure_ad_token_provider=azure_ad_token_provider,
67+
azure_deployment=os.getenv("AZURE_OPENAI_EVAL_DEPLOYMENT"),
68+
model=os.environ["AZURE_OPENAI_EVAL_MODEL"],
69+
validate_base_url=False,
70+
)
71+
)
72+
73+
# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
74+
generator_embeddings = LangchainEmbeddingsWrapper(
75+
AzureOpenAIEmbeddings(
76+
openai_api_version=azure_openai_api_version,
77+
azure_endpoint=azure_endpoint,
78+
azure_ad_token_provider=azure_ad_token_provider,
79+
azure_deployment=os.getenv("AZURE_OPENAI_EMB_DEPLOYMENT"),
80+
model=os.environ["AZURE_OPENAI_EMB_MODEL_NAME"],
81+
)
82+
)
83+
84+
# Load or create the knowledge graph
85+
if kg_file:
86+
full_path_to_kg = root_dir / kg_file
87+
if not os.path.exists(full_path_to_kg):
88+
raise FileNotFoundError(f"Knowledge graph file {full_path_to_kg} not found.")
89+
logger.info("Loading existing knowledge graph from %s", full_path_to_kg)
90+
kg = KnowledgeGraph.load(full_path_to_kg)
91+
else:
92+
# Make a knowledge_graph from Azure AI Search documents
93+
search_docs = get_search_documents(azure_credential, num_search_documents)
94+
95+
logger.info("Creating a RAGAS knowledge graph based off of %d search documents", len(search_docs))
96+
nodes = []
97+
for doc in search_docs:
98+
content = doc["content"]
99+
citation = doc["sourcepage"]
100+
node = Node(
101+
type=NodeType.DOCUMENT,
102+
properties={
103+
"page_content": f"[[{citation}]]: {content}",
104+
"document_metadata": {"citation": citation},
105+
},
106+
)
107+
nodes.append(node)
108+
109+
kg = KnowledgeGraph(nodes=nodes)
110+
111+
logger.info("Using RAGAS to apply transforms to knowledge graph")
112+
transforms = default_transforms(
113+
documents=[LCDocument(page_content=doc["content"]) for doc in search_docs],
114+
llm=generator_llm,
115+
embedding_model=generator_embeddings,
116+
)
117+
apply_transforms(kg, transforms)
118+
119+
kg.save(root_dir / "ground_truth_kg.json")
120+
121+
logger.info("Using RAGAS knowledge graph to generate %d questions", num_questions)
122+
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings, knowledge_graph=kg)
123+
dataset = generator.generate(testset_size=num_questions, with_debugging_logs=True)
124+
125+
qa_pairs = []
126+
for sample in dataset.samples:
127+
question = sample.eval_sample.user_input
128+
truth = sample.eval_sample.reference
129+
# Grab the citation in square brackets from the reference_contexts and add it to the truth
130+
citations = []
131+
for context in sample.eval_sample.reference_contexts:
132+
match = re.search(r"\[\[(.*?)\]\]", context)
133+
if match:
134+
citation = match.group(1)
135+
citations.append(f"[{citation}]")
136+
truth += " " + " ".join(citations)
137+
qa_pairs.append({"question": question, "truth": truth})
138+
139+
with open(root_dir / "ground_truth.jsonl", "a") as f:
140+
logger.info("Writing %d QA pairs to %s", len(qa_pairs), f.name)
141+
for qa_pair in qa_pairs:
142+
f.write(json.dumps(qa_pair) + "\n")
143+
144+
145+
if __name__ == "__main__":
146+
logging.basicConfig(
147+
level=logging.WARNING, format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)]
148+
)
149+
logger.setLevel(logging.INFO)
150+
load_azd_env()
151+
152+
parser = argparse.ArgumentParser(description="Generate ground truth data using AI Search index and RAGAS.")
153+
parser.add_argument("--numsearchdocs", type=int, help="Specify the number of search results to fetch")
154+
parser.add_argument("--numquestions", type=int, help="Specify the number of questions to generate.", default=200)
155+
parser.add_argument("--kgfile", type=str, help="Specify the path to an existing knowledge graph file")
156+
157+
args = parser.parse_args()
158+
159+
generate_ground_truth_ragas(
160+
num_search_documents=args.numsearchdocs, num_questions=args.numquestions, kg_file=args.kgfile
161+
)

0 commit comments

Comments
 (0)