Skip to content

Commit a946e98

Browse files
committed
Question generation includes additional prompt instructions
Signed-off-by: Bill Murdock <[email protected]>
1 parent 4ad8ffd commit a946e98

File tree

4 files changed

+28918
-887
lines changed

4 files changed

+28918
-887
lines changed

notebooks/evaluation/README.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# README: RAG System Evaluation Framework
2+
3+
This repository provides a set of tools to evaluate and compare the performance of Retrieval-Augmented Generation (RAG) systems. Specifically, these notebooks demonstrate a framework for:
4+
5+
1. **Synthesizing a question-answer dataset** from a source document.
6+
2. **Evaluating two RAG pipelines** (Llama Stack and LlamaIndex) on the generated dataset.
7+
3. **Analyzing the results** using statistical methods to determine significance.
8+
9+
The primary goal is to offer a reproducible methodology for comparing RAG system performance on a given knowledge base.
10+
11+
## Table of Contents
12+
13+
- [Project Structure](https://www.google.com/search?q=%23project-structure)
14+
- [Getting Started](https://www.google.com/search?q=%23getting-started)
15+
- [Summary of Findings](https://www.google.com/search?q=%23summary-of-findings)
16+
- [Detailed Results](https://www.google.com/search?q=%23detailed-results)
17+
- [Key Limitations of this Study](https://www.google.com/search?q=%23key-limitations-of-this-study)
18+
- [Further Observations](https://www.google.com/search?q=%23further-observations)
19+
20+
## Project Structure
21+
22+
This directory includes the following components:
23+
24+
* **Jupyter Notebooks**:
25+
26+
* [`make-sample-questions.ipynb`](https://www.google.com/search?q=./make-sample-questions.ipynb): Generates a dataset of sample questions and reference answers from a source document.
27+
* [`evaluate-using-sample-questions-lls-vs-li.ipynb`](https://www.google.com/search?q=./evaluate-using-sample-questions-lls-vs-li.ipynb): Runs Llama Stack and LlamaIndex RAG pipelines on the generated questions, evaluates their responses using the Ragas framework, and performs statistical significance testing with SciPy.
28+
29+
* **Supporting Code**:
30+
31+
* [`evaluation_utilities.py`](https://www.google.com/search?q=./evaluation_utilities.py): Utility functions and helper code for the evaluation notebooks.
32+
33+
* **Sample Data**:
34+
35+
* [`qna-ibm-2024-2250-2239.json`](https://www.google.com/search?q=./qna-ibm-2024-2250-2239.json): A Q\&A dataset generated from the IBM 2024 annual report without special instructions.
36+
* [`qna-ibm-2024b-2220-2196.json`](https://www.google.com/search?q=./qna-ibm-2024b-2220-2196.json): A Q\&A dataset generated from the same report, but using the default special instructions in the notebook to produce more diverse questions.
37+
* **Note on filenames**: The numbers in the JSON filenames (`{configured_questions}-{final_question_count}`) may not perfectly match the final counts in the file due to de-duplication steps.
38+
39+
* **Configuration**:
40+
41+
* [`requirements.txt`](https://www.google.com/search?q=./requirements.txt): A list of Python libraries required to run the notebooks.
42+
* [`run.yaml`](https://www.google.com/search?q=./run.yaml): A configuration file for the Llama Stack server.
43+
44+
## Getting Started
45+
46+
Follow these steps to reproduce the evaluation.
47+
48+
### 1\. Install Dependencies
49+
50+
Install all the necessary Python libraries using pip:
51+
52+
```bash
53+
pip install -r requirements.txt
54+
```
55+
56+
### 2\. Start the Llama Stack Server
57+
58+
The evaluation notebook requires a running Llama Stack server. Start it from your command line using the provided configuration:
59+
60+
```bash
61+
llama stack run run.yaml --image-type venv
62+
```
63+
64+
### 3\. Run the Notebooks
65+
66+
1. **(Optional)** Run `make-sample-questions.ipynb` if you want to generate your own question-answer dataset from a new document.
67+
2. Run `evaluate-using-sample-questions-lls-vs-li.ipynb` to execute the comparison between Llama Stack and LlamaIndex using one of the sample `.json` files.
68+
69+
> **Note on Scale**: Both notebooks are configured by default to run on a small number of questions for quick results. Instructions are included within the notebooks on how to adjust the configuration to run on the full datasets.
70+
71+
## Summary of Findings
72+
73+
Across both datasets, our results show:
74+
75+
* **Higher Accuracy for Llama Stack**: Llama Stack consistently achieved a small but statistically significant advantage in accuracy metrics (`nv_accuracy` and `domain_specific_rubrics`) for questions that had reference answers.
76+
* **Superior Handling of Unanswerable Questions**: Llama Stack demonstrated a much stronger ability to correctly identify and refuse to answer questions that were designed to be unanswerable based on the source document. A higher "Percent Unanswered" score is better in this context.
77+
78+
We hypothesize these differences may stem from variations in model prompting, document chunking strategies, or text processing between the two frameworks.
79+
80+
## Detailed Results
81+
82+
The tables below summarize the performance metrics from our full runs. All p-values are less than `0.05`, indicating the observed differences are statistically significant.
83+
84+
### Dataset 1: `qna-ibm-2024-2250-2239.json`
85+
86+
| Metric (Higher is Better) | Llama Stack (`gpt-3.5-turbo`) | LlamaIndex (`gpt-3.5-turbo`) | p-value | Conclusion |
87+
| :--- | :---: | :---: | :---: | :--- |
88+
| **Questions with Answers (1479)** | | | | |
89+
| `nv_accuracy` | 0.5046 | 0.4696 | 0.0002 | Advantage for Llama Stack |
90+
| `domain_specific_rubrics` (score out of 5) | 3.9757 | 3.9033 | 0.0310 | Advantage for Llama Stack |
91+
| **Questions without Answers (760)** | | | | |
92+
| `Percent Unanswered` | **23.95%** | 8.42% | 0.0002 | Advantage for Llama Stack |
93+
94+
### Dataset 2: `qna-ibm-2024b-2220-2196.json`
95+
96+
| Metric (Higher is Better) | Llama Stack (`gpt-3.5-turbo`) | LlamaIndex (`gpt-3.5-turbo`) | p-value | Conclusion |
97+
| :--- | :---: | :---: | :---: | :--- |
98+
| **Questions with Answers (1402)** | | | | |
99+
| `nv_accuracy` | 0.4918 | 0.4358 | 0.0002 | Advantage for Llama Stack |
100+
| `domain_specific_rubrics` (score out of 5) | 3.9073 | 3.7582 | 0.0002 | Advantage for Llama Stack |
101+
| **Questions without Answers (794)** | | | | |
102+
| `Percent Unanswered` | **31.74%** | 7.68% | 0.0002 | Advantage for Llama Stack |
103+
104+
## Key Limitations of this Study
105+
106+
While these results are informative, it is crucial to consider their limitations:
107+
108+
1. **Single Dataset**: This evaluation uses only one document. Performance could vary significantly with different data types, topics, or multiple documents.
109+
2. **Synthetic Questions**: Questions generated by an LLM may not perfectly represent the questions real users would ask. While we used prompt engineering to increase diversity, it is not a substitute for real-world query logs.
110+
3. **Imperfect Ground Truth**: Our reference answers were generated by a powerful RAG system (using `gpt-4o`), not by humans. This introduces noise into the evaluation, though we assume it affects both systems equally.
111+
4. **Assumption on Unanswerable Questions**: We assume that if our reference RAG system doesn't answer a question, it is truly unanswerable. This assumption may be flawed and could contribute to the low scores for refusing to answer.
112+
5. **Potential for Framework Bias**: Since the reference RAG system was built with LlamaIndex, it could theoretically introduce a bias in favor of LlamaIndex. However, the results show Llama Stack outperforming, suggesting any such bias is likely minimal.
113+
6. **Evaluation Metric Imperfections**: The Ragas metrics and the `gpt-4o` model used to power them are not perfect. This is another source of potential noise.
114+
7. **Custom Metric Validity**: The custom prompt used to determine if a question was answered has not been rigorously validated, though it appears to function well upon casual inspection.
115+
116+
## Further Observations
117+
118+
A key takeaway is that the **absolute performance of both RAG systems is quite low** in this challenging evaluation. Accuracy hovers around 50%, and the ability to correctly ignore unanswerable questions is even lower.
119+
120+
We believe this is partly due to the limitations mentioned above, but also because our question generation method produces a more difficult and diverse set of questions than standard benchmarks. Future work should validate whether these challenging synthetic questions are more representative of the difficulties a RAG system would face in a real-world deployment.

0 commit comments

Comments
 (0)