Skip to content

Commit 25b06de

Browse files
authored
Merge pull request #54 from Azure-Samples/dontknowness
Added citationmatch metric and tools for evaluating "I don't know" answers
2 parents 94c874b + 03b9efd commit 25b06de

File tree

12 files changed

+258
-11
lines changed

12 files changed

+258
-11
lines changed

.vscode/settings.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"python.testing.pytestArgs": [
3+
"scripts"
4+
],
5+
"python.testing.unittestEnabled": false,
6+
"python.testing.pytestEnabled": true
7+
}

README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,3 +275,64 @@ and the GPT metrics below each answer.
275275
![Screenshot of CLI tool for comparing a question with 2 answers](docs/screenshot_compare.png)]
276276
277277
Use the buttons at the bottom to navigate to the next question or quit the tool.
278+
279+
## Measuring app's ability to say "I don't know"
280+
281+
The evaluation flow described above focused on evaluating a model’s answers for a set of questions that *could* be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” The GPT models are trained to try and be helpful, so their tendency is to always give some sort of answer, especially for answers that were in their training data. If you want to ensure your app can say “I don’t know” when it should, you need to evaluate it on a different set of questions with a different metric.
282+
283+
### Generating ground truth data for answer-less questions
284+
285+
For this evaluation, our ground truth data needs to be a set of question whose answer should provoke an "I don’t know" response from the data. There are several categories of such questions:
286+
287+
* **Unknowable**: Questions that are related to the sources but not actually in them (and not public knowledge).
288+
* **Uncitable**: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
289+
* **Related**: Similar topics to sources, so LLM will be particularly tempted to think the sources know.
290+
* **Unrelated**: Completely unrelated to sources, so LLM shouldn’t be as tempted to think the sources know.
291+
* **Nonsensical**: Questions that are non-questions, that a human would scratch their head at and ask for clarification.
292+
293+
You can write these questions manually, but it’s also possible to generate them using a generator script in this repo,
294+
assuming you already have ground truth data with answerable questions.
295+
296+
```shell
297+
python -m scripts generate_dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45
298+
```
299+
300+
That script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.
301+
302+
When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the "unknowable" questions at the top of the file, since you may decide that some of those are actually knowable, and you may want to reword or rewrite entirely.
303+
304+
### Running an evaluation for answer-less questions
305+
306+
This repo contains a custom GPT metric called "dontknowness" that rates answers from 1-5, where 1 is "answered the question completely with no certainty" and 5 is "said it didn't know and attempted no answer". The goal is for all answers to be rated 4 or 5.
307+
308+
Here's an example configuration JSON that requests that metric, referencing the new ground truth data and a new output folder:
309+
310+
```json
311+
{
312+
"testdata_path": "example_input/qa_dontknows.jsonl",
313+
"results_dir": "example_results_dontknows/baseline",
314+
"requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
315+
"target_url": "http://localhost:50505/chat",
316+
"target_parameters": {
317+
}
318+
}
319+
```
320+
321+
We recommend a separate output folder, as you'll likely want to make multiple runs and easily compare between those runs using the [review tools](#viewing-the-results).
322+
323+
Run the evaluation like this:
324+
325+
```shell
326+
python -m scripts evaluate --config=example_config_dontknows.json
327+
```
328+
329+
The results will be stored in the `results_dir` folder, and can be reviewed using the [review tools](#viewing-the-results).
330+
331+
### Improving the app's ability to say "I don't know"
332+
333+
If the app is not saying "I don't know" enough, you can use the `diff` tool to compare the answers for the "dontknows" questions across runs, and see if the answers are improving. Changes you can try:
334+
335+
* Adjust the prompt to encourage the model to say "I don't know" more often. Remove anything in the prompt that might be distracting or overly encouraging it to answer.
336+
* Try using GPT-4 instead of GPT-3.5. The results will be slower (see the latency column) but it may be more likely to say "I don't know" when it should.
337+
* Adjust the temperature of the model used by your app.
338+
* Add an additional LLM step in your app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources. If the confidence is low, the app should say "I don't know".

review_tools/summary_app.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,9 @@ def __init__(self, results_dir: Path) -> None:
4949
metric_counts[metric_name] = metric_counts.get(metric_name, 0) + 1
5050

5151
# Only show metrics that have shown up at least twice across runs
52-
shared_metric_names = [metric_name for metric_name, count in metric_counts.items() if count > 1]
52+
shared_metric_names = [
53+
metric_name for metric_name, count in metric_counts.items() if count > 1 or len(run_summaries) == 1
54+
]
5355
shared_metric_stats = {metric_name: set() for metric_name in shared_metric_names}
5456

5557
# Now figure out what stat to show about each metric

scripts/cli.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
from . import service_setup
88
from .evaluate import run_evaluate_from_config
9-
from .generate import generate_test_qa_data
9+
from .generate import generate_dontknows_qa_data, generate_test_qa_data
1010

1111
app = typer.Typer(pretty_exceptions_enable=False)
1212

@@ -46,5 +46,19 @@ def generate(
4646
)
4747

4848

49+
@app.command()
50+
def generate_dontknows(
51+
input: Path = typer.Option(exists=True, dir_okay=False, file_okay=True),
52+
output: Path = typer.Option(exists=False, dir_okay=False, file_okay=True),
53+
numquestions: int = typer.Option(help="Number of questions to generate", default=40),
54+
):
55+
generate_dontknows_qa_data(
56+
openai_config=service_setup.get_openai_config(),
57+
num_questions_total=numquestions,
58+
input_file=Path.cwd() / input,
59+
output_file=Path.cwd() / output,
60+
)
61+
62+
4963
def cli():
5064
app()

scripts/evaluate.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
logger = logging.getLogger("scripts")
1414

1515

16-
def send_question_to_target(question: str, target_url: str, parameters: dict = {}, raise_error=False):
16+
def send_question_to_target(question: str, truth: str, target_url: str, parameters: dict = {}, raise_error=False):
1717
headers = {"Content-Type": "application/json"}
1818
body = {
1919
"messages": [{"content": question, "role": "user"}],
@@ -45,6 +45,7 @@ def send_question_to_target(question: str, target_url: str, parameters: dict = {
4545
raise e
4646
return {
4747
"question": question,
48+
"truth": truth,
4849
"answer": str(e),
4950
"context": str(e),
5051
"latency": -1,
@@ -78,7 +79,7 @@ def run_evaluation(
7879
logger.info("Sending a test question to the target to ensure it is running...")
7980
try:
8081
target_data = send_question_to_target(
81-
"What information is in your knowledge base?", target_url, target_parameters, raise_error=True
82+
"What information is in your knowledge base?", "So much", target_url, target_parameters, raise_error=True
8283
)
8384
logger.info(
8485
'Successfully received response from target: "question": "%s", "answer": "%s", "context": "%s"',
@@ -104,7 +105,7 @@ def run_evaluation(
104105

105106
# Wrap the target function so that it can be called with a single argument
106107
async def wrap_target(question: str, truth: str):
107-
return send_question_to_target(question, target_url, target_parameters)
108+
return send_question_to_target(question, truth, target_url, target_parameters)
108109

109110
logger.info("Starting evaluation...")
110111
for metric in requested_metrics:

scripts/evaluate_metrics/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .builtin_metrics import BuiltinCoherenceMetric, BuiltinGroundednessMetric, BuiltinRelevanceMetric
22
from .code_metrics import AnswerLengthMetric, HasCitationMetric, LatencyMetric
3-
from .prompt_metrics import CoherenceMetric, GroundednessMetric, RelevanceMetric
3+
from .prompt_metrics import CoherenceMetric, DontKnownessMetric, GroundednessMetric, RelevanceMetric
44

55
metrics = [
66
BuiltinCoherenceMetric,
@@ -9,6 +9,7 @@
99
CoherenceMetric,
1010
RelevanceMetric,
1111
GroundednessMetric,
12+
DontKnownessMetric,
1213
LatencyMetric,
1314
AnswerLengthMetric,
1415
HasCitationMetric,

scripts/evaluate_metrics/code_metrics.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,29 @@ def get_aggregate_stats(cls, df):
4242
}
4343

4444

45+
class CitationMatchMetric(BaseMetric):
46+
47+
METRIC_NAME = "citation_match"
48+
49+
@classmethod
50+
def get_metric(cls):
51+
def citation_match(*, data, **kwargs):
52+
# Return true if all citations in the truth are present in the answer
53+
truth_citations = set(re.findall(r"\[[^\]]+\]", data["truth"]))
54+
answer_citations = set(re.findall(r"\[[^\]]+\]", data["answer"]))
55+
citation_match = truth_citations.issubset(answer_citations)
56+
return {"citation_match": citation_match}
57+
58+
return citation_match
59+
60+
@classmethod
61+
def get_aggregate_stats(cls, df):
62+
return {
63+
"total": int(df[cls.METRIC_NAME].sum()),
64+
"rate": round(df[cls.METRIC_NAME].mean(), 2),
65+
}
66+
67+
4568
class LatencyMetric(BaseMetric):
4669

4770
METRIC_NAME = "latency"

scripts/evaluate_metrics/prompt_metrics.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,8 @@ class CoherenceMetric(CustomRatingMetric):
3131
class GroundednessMetric(CustomRatingMetric):
3232

3333
METRIC_NAME = "groundedness"
34+
35+
36+
class DontKnownessMetric(CustomRatingMetric):
37+
38+
METRIC_NAME = "dontknowness"
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
System:
2+
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
3+
4+
User:
5+
The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
6+
One star: the answer completely answers the question and conveys no uncertainty
7+
Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
8+
Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
9+
Four stars: the answer conveys uncertainty and makes no attempt to answer the question
10+
Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.
11+
12+
This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
13+
14+
question: What are the main goals of Perseverance Mars rover mission?
15+
answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
16+
stars: 1
17+
18+
question: What field did Marie Curie excel in?
19+
answer: I'm not sure, but I think Marie Curie excelled in the field of science.
20+
stars: 2
21+
22+
question: What are the main components of the Mediterranean diet?
23+
answer: I don't have an answer in my sources but I think the diet has some fats?
24+
stars: 3
25+
26+
question: What are the main attractions of the Queen's Royal Castle?
27+
answer: I'm not certain. Perhaps try rephrasing the question?
28+
stars: 4
29+
30+
question: Where were The Beatles formed?
31+
answer: I'm sorry, I don't know, that answer is not in my sources.
32+
stars: 5
33+
34+
question: {{question}}
35+
answer: {{answer}}
36+
stars:
37+
38+
Your response must include following fields and should be in json format:
39+
score: Number of stars based on definition above
40+
reason: Reason why the score was given

scripts/generate.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
import json
22
import logging
3+
import math
4+
import random
35
from pathlib import Path
46

57
from azure.ai.generative.synthetic.qa import QADataGenerator, QAType
68
from azure.search.documents import SearchClient
79

10+
from . import service_setup
11+
812
logger = logging.getLogger("scripts")
913

1014

@@ -50,3 +54,75 @@ def generate_test_qa_data(
5054
with open(output_file, "w", encoding="utf-8") as f:
5155
for item in qa:
5256
f.write(json.dumps(item) + "\n")
57+
58+
59+
def generate_based_on_questions(openai_client, model: str, qa: list, num_questions: int, prompt: str):
60+
existing_questions = ""
61+
if qa:
62+
qa = random.sample(qa, len(qa)) # Shuffle questions for some randomness
63+
existing_questions = "\n".join([item["question"] for item in qa])
64+
65+
gpt_response = openai_client.chat.completions.create(
66+
model=model,
67+
messages=[
68+
{
69+
"role": "user",
70+
"content": f"{prompt} Only generate {num_questions} TOTAL. Separate each question by a new line. \n{existing_questions}", # noqa: E501
71+
}
72+
],
73+
n=1,
74+
max_tokens=num_questions * 50,
75+
temperature=0.3,
76+
)
77+
78+
qa = []
79+
for message in gpt_response.choices[0].message.content.split("\n")[0:num_questions]:
80+
qa.append({"question": message, "truth": f"Generated from this prompt: {prompt}"})
81+
return qa
82+
83+
84+
def generate_dontknows_qa_data(openai_config: dict, num_questions_total: int, input_file: Path, output_file: Path):
85+
logger.info("Generating off-topic questions based on %s", input_file)
86+
with open(input_file, encoding="utf-8") as f:
87+
qa = [json.loads(line) for line in f.readlines()]
88+
89+
openai_client = service_setup.get_openai_client(openai_config)
90+
dontknows_qa = []
91+
num_questions_each = math.ceil(num_questions_total / 4)
92+
dontknows_qa += generate_based_on_questions(
93+
openai_client,
94+
openai_config["model"],
95+
qa,
96+
num_questions_each,
97+
f"Given these questions, suggest {num_questions_each} questions that are very related but are not directly answerable by the same sources. Do not simply ask for other examples of the same thing - your question should be standalone.", # noqa: E501
98+
)
99+
dontknows_qa += generate_based_on_questions(
100+
openai_client,
101+
openai_config["model"],
102+
qa,
103+
num_questions_each,
104+
f"Given these questions, suggest {num_questions_each} questions with similar keywords that are about publicly known facts.", # noqa: E501
105+
)
106+
dontknows_qa += generate_based_on_questions(
107+
openai_client,
108+
openai_config["model"],
109+
qa,
110+
num_questions_each,
111+
f"Given these questions, suggest {num_questions_each} questions that are not related to these topics at all but have well known answers.", # noqa: E501
112+
)
113+
remaining = num_questions_total - len(dontknows_qa)
114+
dontknows_qa += generate_based_on_questions(
115+
openai_client,
116+
openai_config["model"],
117+
qa=None,
118+
num_questions=remaining,
119+
prompt=f"Suggest {remaining} questions that are nonsensical, and would result in confusion if you asked it.", # noqa: E501
120+
)
121+
122+
logger.info("Writing %d off-topic questions to %s", len(dontknows_qa), output_file)
123+
directory = Path(output_file).parent
124+
if not directory.exists():
125+
directory.mkdir(parents=True)
126+
with open(output_file, "w", encoding="utf-8") as f:
127+
for item in dontknows_qa:
128+
f.write(json.dumps(item) + "\n")

0 commit comments

Comments
 (0)