You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Review the generated data in`evals/ground_truth.jsonl` after running that script, removing any question/answer pairs that don't seem like realistic user input.
76
76
77
+
77
78
## Run bulk evaluation
78
79
79
80
Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
@@ -94,10 +95,26 @@ The options are:
94
95
For more details about how to run locally the chat api see [Local Development with IntelliJ](local-development-intellij.md#running-the-spring-boot-chat-api-locally).
95
96
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, and the TPM capacity of the evaluation model, and the number of GPT metrics requested.
96
97
98
+
> [!IMPORTANT]
99
+
> Ground truth data is generated using a knowledge graph created out of the same search index used by the rag flow. It's based on [RAGAS evaluation framework](https://docs.ragas.io/en/stable/).If you want to learn more about data generation approach you can check [Tesset Generation for RAG](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/)
100
+
97
101
## Review the evaluation results
98
102
99
103
The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
100
104
105
+
The evaluation uses the following default metrics (as configured in`evaluate_config.json`), with results available in the `summary.json` file:
106
+
107
+
***gpt_groundedness**: Measures how well the answer is grounded in the retrieved context. Returns a pass rate and mean rating (1-5 scale).
108
+
***gpt_relevance**: Evaluates the relevance of the answer to the user's question. Returns a pass rate and mean rating (1-5 scale).
109
+
* **answer_length**: Tracks the length of generated answers in characters (mean, max, min values).
110
+
* **latency**: Measures response time in seconds for each question (mean, max, min values).
111
+
* **citations_matched**: Counts how many answers include properly matched citations from the source documents.
112
+
* **any_citation**: Tracks whether answers include any citations at all.
113
+
114
+
> [!IMPORTANT]
115
+
> **gpt_groundedness** and **gpt_relevance** are built-in metrics provided by [Azure AI evaluation sdk](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk).
116
+
**answer length**, **latency**, **citations matched** and **any_citation** are custom metrics defined in [evaluate.py](../../evals/evaluate.py) or from [ai-rag-chat-evaluator project](https://github.com/Azure-Samples/ai-rag-chat-evaluator/blob/main/src/evaltools/eval/evaluate_metrics/code_metrics.py)
117
+
101
118
You can see a summary of results across all evaluation runs by running the following command:
0 commit comments