Merge pull request #54 from Azure-Samples/dontknowness

pamelafox · web-flow · commit 25b06de9e920 · 2024-03-04T15:39:10.000-08:00
Added citationmatch metric and tools for evaluating "I don't know" answers
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,7 @@
+{
+    "python.testing.pytestArgs": [
+        "scripts"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true
+}
diff --git a/README.md b/README.md
@@ -275,3 +275,64 @@ and the GPT metrics below each answer.
 ![Screenshot of CLI tool for comparing a question with 2 answers](docs/screenshot_compare.png)]
 
 Use the buttons at the bottom to navigate to the next question or quit the tool.
+
+## Measuring app's ability to say "I don't know"
+
+The evaluation flow described above focused on evaluating a model’s answers for a set of questions that *could* be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” The GPT models are trained to try and be helpful, so their tendency is to always give some sort of answer, especially for answers that were in their training data. If you want to ensure your app can say “I don’t know” when it should, you need to evaluate it on a different set of questions with a different metric.
+
+### Generating ground truth data for answer-less questions
+
+For this evaluation, our ground truth data needs to be a set of question whose answer should provoke an "I don’t know" response from the data. There are several categories of such questions:
+
+* **Unknowable**: Questions that are related to the sources but not actually in them (and not public knowledge).
+* **Uncitable**: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
+    * **Related**: Similar topics to sources, so LLM will be particularly tempted to think the sources know.
+    * **Unrelated**: Completely unrelated to sources, so LLM shouldn’t be as tempted to think the sources know.
+* **Nonsensical**: Questions that are non-questions, that a human would scratch their head at and ask for clarification.
+
+You can write these questions manually, but it’s also possible to generate them using a generator script in this repo,
+assuming you already have ground truth data with answerable questions.
+
+```shell
+python -m scripts generate_dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45
+```
+
+That script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.
+
+When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the "unknowable" questions at the top of the file, since you may decide that some of those are actually knowable, and you may want to reword or rewrite entirely.
+
+### Running an evaluation for answer-less questions
+
+This repo contains a custom GPT metric called "dontknowness" that rates answers from 1-5, where 1 is "answered the question completely with no certainty" and 5 is "said it didn't know and attempted no answer". The goal is for all answers to be rated 4 or 5.
+
+Here's an example configuration JSON that requests that metric, referencing the new ground truth data and a new output folder:
+
+```json
+{
+    "testdata_path": "example_input/qa_dontknows.jsonl",
+    "results_dir": "example_results_dontknows/baseline",
+    "requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
+    "target_url": "http://localhost:50505/chat",
+    "target_parameters": {
+    }
+}
+```
+
+We recommend a separate output folder, as you'll likely want to make multiple runs and easily compare between those runs using the [review tools](#viewing-the-results).
+
+Run the evaluation like this:
+
+```shell
+python -m scripts evaluate --config=example_config_dontknows.json
+```
+
+The results will be stored in the `results_dir` folder, and can be reviewed using the [review tools](#viewing-the-results).
+
+### Improving the app's ability to say "I don't know"
+
+If the app is not saying "I don't know" enough, you can use the `diff` tool to compare the answers for the "dontknows" questions across runs, and see if the answers are improving. Changes you can try:
+
+* Adjust the prompt to encourage the model to say "I don't know" more often. Remove anything in the prompt that might be distracting or overly encouraging it to answer.
+* Try using GPT-4 instead of GPT-3.5. The results will be slower (see the latency column) but it may be more likely to say "I don't know" when it should.
+* Adjust the temperature of the model used by your app.
+* Add an additional LLM step in your app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources. If the confidence is low, the app should say "I don't know".
diff --git a/review_tools/summary_app.py b/review_tools/summary_app.py
@@ -49,7 +49,9 @@ def __init__(self, results_dir: Path) -> None:
                     metric_counts[metric_name] = metric_counts.get(metric_name, 0) + 1
 
         # Only show metrics that have shown up at least twice across runs
-        shared_metric_names = [metric_name for metric_name, count in metric_counts.items() if count > 1]
+        shared_metric_names = [
+            metric_name for metric_name, count in metric_counts.items() if count > 1 or len(run_summaries) == 1
+        ]
         shared_metric_stats = {metric_name: set() for metric_name in shared_metric_names}
 
         # Now figure out what stat to show about each metric
diff --git a/scripts/cli.py b/scripts/cli.py
@@ -6,7 +6,7 @@
 
 from . import service_setup
 from .evaluate import run_evaluate_from_config
-from .generate import generate_test_qa_data
+from .generate import generate_dontknows_qa_data, generate_test_qa_data
 
 app = typer.Typer(pretty_exceptions_enable=False)
 
@@ -46,5 +46,19 @@ def generate(
     )
 
 
+@app.command()
+def generate_dontknows(
+    input: Path = typer.Option(exists=True, dir_okay=False, file_okay=True),
+    output: Path = typer.Option(exists=False, dir_okay=False, file_okay=True),
+    numquestions: int = typer.Option(help="Number of questions to generate", default=40),
+):
+    generate_dontknows_qa_data(
+        openai_config=service_setup.get_openai_config(),
+        num_questions_total=numquestions,
+        input_file=Path.cwd() / input,
+        output_file=Path.cwd() / output,
+    )
+
+
 def cli():
     app()
diff --git a/scripts/evaluate.py b/scripts/evaluate.py
@@ -13,7 +13,7 @@
 logger = logging.getLogger("scripts")
 
 
-def send_question_to_target(question: str, target_url: str, parameters: dict = {}, raise_error=False):
+def send_question_to_target(question: str, truth: str, target_url: str, parameters: dict = {}, raise_error=False):
     headers = {"Content-Type": "application/json"}
     body = {
         "messages": [{"content": question, "role": "user"}],
@@ -45,6 +45,7 @@ def send_question_to_target(question: str, target_url: str, parameters: dict = {
             raise e
         return {
             "question": question,
+            "truth": truth,
             "answer": str(e),
             "context": str(e),
             "latency": -1,
@@ -78,7 +79,7 @@ def run_evaluation(
     logger.info("Sending a test question to the target to ensure it is running...")
     try:
         target_data = send_question_to_target(
-            "What information is in your knowledge base?", target_url, target_parameters, raise_error=True
+            "What information is in your knowledge base?", "So much", target_url, target_parameters, raise_error=True
         )
         logger.info(
             'Successfully received response from target: "question": "%s", "answer": "%s", "context": "%s"',
@@ -104,7 +105,7 @@ def run_evaluation(
 
     # Wrap the target function so that it can be called with a single argument
     async def wrap_target(question: str, truth: str):
-        return send_question_to_target(question, target_url, target_parameters)
+        return send_question_to_target(question, truth, target_url, target_parameters)
 
     logger.info("Starting evaluation...")
     for metric in requested_metrics:
diff --git a/scripts/evaluate_metrics/__init__.py b/scripts/evaluate_metrics/__init__.py
@@ -1,6 +1,6 @@
 from .builtin_metrics import BuiltinCoherenceMetric, BuiltinGroundednessMetric, BuiltinRelevanceMetric
 from .code_metrics import AnswerLengthMetric, HasCitationMetric, LatencyMetric
-from .prompt_metrics import CoherenceMetric, GroundednessMetric, RelevanceMetric
+from .prompt_metrics import CoherenceMetric, DontKnownessMetric, GroundednessMetric, RelevanceMetric
 
 metrics = [
     BuiltinCoherenceMetric,
@@ -9,6 +9,7 @@
     CoherenceMetric,
     RelevanceMetric,
     GroundednessMetric,
+    DontKnownessMetric,
     LatencyMetric,
     AnswerLengthMetric,
     HasCitationMetric,
diff --git a/scripts/evaluate_metrics/code_metrics.py b/scripts/evaluate_metrics/code_metrics.py
@@ -42,6 +42,29 @@ def get_aggregate_stats(cls, df):
         }
 
 
+class CitationMatchMetric(BaseMetric):
+
+    METRIC_NAME = "citation_match"
+
+    @classmethod
+    def get_metric(cls):
+        def citation_match(*, data, **kwargs):
+            # Return true if all citations in the truth are present in the answer
+            truth_citations = set(re.findall(r"\[[^\]]+\]", data["truth"]))
+            answer_citations = set(re.findall(r"\[[^\]]+\]", data["answer"]))
+            citation_match = truth_citations.issubset(answer_citations)
+            return {"citation_match": citation_match}
+
+        return citation_match
+
+    @classmethod
+    def get_aggregate_stats(cls, df):
+        return {
+            "total": int(df[cls.METRIC_NAME].sum()),
+            "rate": round(df[cls.METRIC_NAME].mean(), 2),
+        }
+
+
 class LatencyMetric(BaseMetric):
 
     METRIC_NAME = "latency"
diff --git a/scripts/evaluate_metrics/prompt_metrics.py b/scripts/evaluate_metrics/prompt_metrics.py
@@ -31,3 +31,8 @@ class CoherenceMetric(CustomRatingMetric):
 class GroundednessMetric(CustomRatingMetric):
 
     METRIC_NAME = "groundedness"
+
+
+class DontKnownessMetric(CustomRatingMetric):
+
+    METRIC_NAME = "dontknowness"
diff --git a/scripts/evaluate_metrics/prompts/dontknowness.jinja2 b/scripts/evaluate_metrics/prompts/dontknowness.jinja2
@@ -0,0 +1,40 @@
+System:
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+
+User:
+The "I don't know"-ness metric is a measure of how much an answer conveys the lack of knowledge or uncertainty, which is useful for making sure a chatbot for a particular domain doesn't answer outside that domain. Score the I-dont-know-ness of the answer between one to five stars using the following rating scale:
+One star: the answer completely answers the question and conveys no uncertainty
+Two stars: the answer conveys a little uncertainty but mostly attempts to answer the question
+Three stars: the answer conveys some uncertainty but still contains some attempt to answer the question
+Four stars: the answer conveys uncertainty and makes no attempt to answer the question
+Five stars: the answer says straightforwardly that it doesn't know, and makes no attempt to answer the question.
+
+This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5.
+
+question: What are the main goals of Perseverance Mars rover mission?
+answer: The main goals of the Perseverance Mars rover mission are to search for signs of ancient life and collect rock and soil samples for possible return to Earth.
+stars: 1
+
+question: What field did Marie Curie excel in?
+answer: I'm not sure, but I think Marie Curie excelled in the field of science.
+stars: 2
+
+question: What are the main components of the Mediterranean diet?
+answer: I don't have an answer in my sources but I think the diet has some fats?
+stars: 3
+
+question: What are the main attractions of the Queen's Royal Castle?
+answer: I'm not certain. Perhaps try rephrasing the question?
+stars: 4
+
+question: Where were The Beatles formed?
+answer: I'm sorry, I don't know, that answer is not in my sources.
+stars: 5
+
+question: {{question}}
+answer: {{answer}}
+stars:
+
+Your response must include following fields and should be in json format:
+score: Number of stars based on definition above
+reason: Reason why the score was given
diff --git a/scripts/generate.py b/scripts/generate.py
@@ -1,10 +1,14 @@
 import json
 import logging
+import math
+import random
 from pathlib import Path
 
 from azure.ai.generative.synthetic.qa import QADataGenerator, QAType
 from azure.search.documents import SearchClient
 
+from . import service_setup
+
 logger = logging.getLogger("scripts")
 
 
@@ -50,3 +54,75 @@ def generate_test_qa_data(
     with open(output_file, "w", encoding="utf-8") as f:
         for item in qa:
             f.write(json.dumps(item) + "\n")
+
+
+def generate_based_on_questions(openai_client, model: str, qa: list, num_questions: int, prompt: str):
+    existing_questions = ""
+    if qa:
+        qa = random.sample(qa, len(qa))  # Shuffle questions for some randomness
+        existing_questions = "\n".join([item["question"] for item in qa])
+
+    gpt_response = openai_client.chat.completions.create(
+        model=model,
+        messages=[
+            {
+                "role": "user",
+                "content": f"{prompt} Only generate {num_questions} TOTAL. Separate each question by a new line. \n{existing_questions}",  # noqa: E501
+            }
+        ],
+        n=1,
+        max_tokens=num_questions * 50,
+        temperature=0.3,
+    )
+
+    qa = []
+    for message in gpt_response.choices[0].message.content.split("\n")[0:num_questions]:
+        qa.append({"question": message, "truth": f"Generated from this prompt: {prompt}"})
+    return qa
+
+
+def generate_dontknows_qa_data(openai_config: dict, num_questions_total: int, input_file: Path, output_file: Path):
+    logger.info("Generating off-topic questions based on %s", input_file)
+    with open(input_file, encoding="utf-8") as f:
+        qa = [json.loads(line) for line in f.readlines()]
+
+    openai_client = service_setup.get_openai_client(openai_config)
+    dontknows_qa = []
+    num_questions_each = math.ceil(num_questions_total / 4)
+    dontknows_qa += generate_based_on_questions(
+        openai_client,
+        openai_config["model"],
+        qa,
+        num_questions_each,
+        f"Given these questions, suggest {num_questions_each} questions that are very related but are not directly answerable by the same sources. Do not simply ask for other examples of the same thing - your question should be standalone.",  # noqa: E501
+    )
+    dontknows_qa += generate_based_on_questions(
+        openai_client,
+        openai_config["model"],
+        qa,
+        num_questions_each,
+        f"Given these questions, suggest {num_questions_each} questions with similar keywords that are about publicly known facts.",  # noqa: E501
+    )
+    dontknows_qa += generate_based_on_questions(
+        openai_client,
+        openai_config["model"],
+        qa,
+        num_questions_each,
+        f"Given these questions, suggest {num_questions_each} questions that are not related to these topics at all but have well known answers.",  # noqa: E501
+    )
+    remaining = num_questions_total - len(dontknows_qa)
+    dontknows_qa += generate_based_on_questions(
+        openai_client,
+        openai_config["model"],
+        qa=None,
+        num_questions=remaining,
+        prompt=f"Suggest {remaining} questions that are nonsensical, and would result in confusion if you asked it.",  # noqa: E501
+    )
+
+    logger.info("Writing %d off-topic questions to %s", len(dontknows_qa), output_file)
+    directory = Path(output_file).parent
+    if not directory.exists():
+        directory.mkdir(parents=True)
+    with open(output_file, "w", encoding="utf-8") as f:
+        for item in dontknows_qa:
+            f.write(json.dumps(item) + "\n")
diff --git a/scripts/tests/test_evaluate.py b/scripts/tests/test_evaluate.py
@@ -16,7 +16,7 @@ def test_send_question_to_target_valid():
         ]
     }
     requests.post = lambda url, headers, json: MockResponse(response)
-    result = send_question_to_target("Question 1", "http://example.com")
+    result = send_question_to_target("Question 1", "Answer 1", "http://example.com")
     assert result["question"] == "Question 1"
     assert result["answer"] == "This is the answer"
     assert result["context"] == "Context 1\n\nContext 2"
@@ -27,7 +27,7 @@ def test_send_question_to_target_missing_error_store():
     # Test case 2: Missing 'choices' key in response
     response = {}
     requests.post = lambda url, headers, json: MockResponse(response)
-    result = send_question_to_target("Question", "http://example.com")
+    result = send_question_to_target("Question", "Answer", "http://example.com")
     assert result["question"] == "Question"
     assert result["answer"] == (
         "Response does not adhere to the expected schema. "
@@ -48,7 +48,7 @@ def test_send_question_to_target_missing_choices():
     response = {}
     requests.post = lambda url, headers, json: MockResponse(response)
     try:
-        send_question_to_target("Question", "http://example.com", raise_error=True)
+        send_question_to_target("Question", "Answer", "http://example.com", raise_error=True)
     except Exception as e:
         assert str(e) == (
             "Response does not adhere to the expected schema. "
@@ -63,7 +63,7 @@ def test_send_question_to_target_missing_content():
     response = {"choices": [{"message": {}, "context": {"data_points": {"text": ["Context 1", "Context 2"]}}}]}
     requests.post = lambda url, headers, json: MockResponse(response)
     try:
-        send_question_to_target("Question", "http://example.com", raise_error=True)
+        send_question_to_target("Question", "Answer", "http://example.com", raise_error=True)
     except Exception as e:
         assert str(e) == (
             "Response does not adhere to the expected schema. "
@@ -78,7 +78,7 @@ def test_send_question_to_target_missing_context():
     response = {"choices": [{"message": {"content": "This is the answer"}}]}
     requests.post = lambda url, headers, json: MockResponse(response)
     try:
-        send_question_to_target("Question", "http://example.com", raise_error=True)
+        send_question_to_target("Question", "Answer", "http://example.com", raise_error=True)
     except Exception as e:
         assert str(e) == (
             "Response does not adhere to the expected schema. "
diff --git a/scripts/tests/test_evaluate_metrics.py b/scripts/tests/test_evaluate_metrics.py