Skip to content

Commit ce0f7c1

Browse files
committed
Update prompts and evals for multimodal
1 parent 9e0c81d commit ce0f7c1

23 files changed

+517
-44
lines changed

app/backend/approaches/prompts/ask_answer_question.prompty

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ In-network deductibles are $500 for employee and $1000 for family [info1.txt] an
4545

4646
user:
4747
{{ user_query }}
48-
{% if image_sources is defined %}{% for image_source in image_sources %}
48+
{% if image_sources %}{% for image_source in image_sources %}
4949
![Image]({{image_source}})
5050
{% endfor %}{% endif %}
5151
{% if text_sources is defined %}Sources:{% for text_source in text_sources %}

app/backend/approaches/prompts/chat_answer_question.prompty

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Make sure the last question ends with ">>".
5454

5555
user:
5656
{{ user_query }}
57-
{% if image_sources is defined %}{% for image_source in image_sources %}
57+
{% if image_sources %}{% for image_source in image_sources %}
5858
![Image]({{image_source}})
5959
{% endfor %}{% endif %}
6060
{% if text_sources is defined %}

docs/evaluation.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ Review the generated data in `evals/ground_truth.jsonl` after running that scrip
7272
7373
## Run bulk evaluation
7474
75-
Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
75+
Review the configuration in `evals/evaluate_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
7676
7777
By default, the evaluation script will evaluate every question in the ground truth data.
7878
Run the evaluation script by running the following command:
@@ -84,10 +84,10 @@ python evals/evaluate.py
8484
The options are:
8585
8686
* `numquestions`: The number of questions to evaluate. By default, this is all questions in the ground truth data.
87-
* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `eval_config.json`.
88-
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `eval_config.json`.
87+
* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `evaluate_config.json`.
88+
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `evaluate_config.json`.
8989
90-
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, and the TPM capacity of the evaluation model, and the number of GPT metrics requested.
90+
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, the TPM capacity of the evaluation model, and the number of LLM-based metrics requested.
9191
9292
## Review the evaluation results
9393
@@ -118,3 +118,9 @@ This repository includes a GitHub Action workflow `evaluate.yaml` that can be us
118118
In order for the workflow to run successfully, you must first set up [continuous integration](./azd.md#github-actions) for the repository.
119119
120120
To run the evaluation on the changes in a PR, a repository member can post a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.
121+
122+
## Evaluate multimodal RAG answers
123+
124+
The repository also includes an `evaluate_config_multimodal.json` file specifically for evaluating multimodal RAG answers. This configuration uses a different ground truth file, `ground_truth_multimodal.jsonl`, which includes questions based off the sample data that require both text and image sources to answer.
125+
126+
Note that the "groundedness" evaluator is not reliable for multimodal RAG, since it does not currently incorporate the image sources. We still include it in the metrics, but the more reliable metrics are "relevance" and "citations matched".

evals/evaluate.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,28 @@
1313

1414
logger = logging.getLogger("ragapp")
1515

16+
# Regex pattern to match citations of the forms:
17+
# [Document Name.pdf#page=7]
18+
# [Document Name.pdf#page=4(figure4_1.png)]
19+
# and supports multiple document extensions such as:
20+
# pdf, html/htm, doc/docx, ppt/pptx, xls/xlsx, csv, txt, json,
21+
# images: jpg/jpeg, png, bmp (listed as BPM in doc), tiff/tif, heif/heiff
22+
# Optional components:
23+
# #page=\d+ -> page anchor (primarily for paged docs like PDFs)
24+
# ( ... ) -> figure/image or sub-resource reference (e.g., (figure4_1.png))
25+
# Explanation of pattern components:
26+
# \[ - Opening bracket
27+
# [^\]]+?\. - Non-greedy match of any chars up to a dot before extension
28+
# (?:pdf|docx?|pptx?|xlsx?|csv|txt|json)
29+
# - Allowed primary file extensions
30+
# (?:#page=\d+)? - Optional page reference
31+
# (?:\([^()\]]+\))? - Optional parenthetical (figure/image reference)
32+
# \] - Closing bracket
33+
CITATION_REGEX = re.compile(
34+
r"\[[^\]]+?\.(?:pdf|html?|docx?|pptx?|xlsx?|csv|txt|json|jpe?g|png|bmp|tiff?|heiff?|heif)(?:#page=\d+)?(?:\([^()\]]+\))?\]",
35+
re.IGNORECASE,
36+
)
37+
1638

1739
class AnyCitationMetric(BaseMetric):
1840
METRIC_NAME = "any_citation"
@@ -23,7 +45,7 @@ def any_citation(*, response, **kwargs):
2345
if response is None:
2446
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")
2547
return {cls.METRIC_NAME: -1}
26-
return {cls.METRIC_NAME: bool(re.search(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))}
48+
return {cls.METRIC_NAME: bool(CITATION_REGEX.search(response))}
2749

2850
return any_citation
2951

@@ -45,9 +67,9 @@ def citations_matched(*, response, ground_truth, **kwargs):
4567
if response is None:
4668
logger.warning("Received response of None, can't compute citation_match metric. Setting to -1.")
4769
return {cls.METRIC_NAME: -1}
48-
# Return true if all citations in the truth are present in the response
49-
truth_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", ground_truth))
50-
response_citations = set(re.findall(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))
70+
# Extract full citation tokens from ground truth and response
71+
truth_citations = set(CITATION_REGEX.findall(ground_truth or ""))
72+
response_citations = set(CITATION_REGEX.findall(response or ""))
5173
# Count the percentage of citations that are present in the response
5274
num_citations = len(truth_citations)
5375
num_matched_citations = len(truth_citations.intersection(response_citations))

evals/evaluate_config.json

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,10 @@
1919
"suggest_followup_questions": false,
2020
"use_oid_security_filter": false,
2121
"use_groups_security_filter": false,
22-
"vector_fields": "textEmbeddingOnly",
23-
"use_gpt4v": false,
24-
"gpt4v_input": "textAndImages",
22+
"search_text_embeddings": true,
23+
"search_image_embeddings": true,
24+
"send_text_sources": true,
25+
"send_image_sources": true,
2526
"language": "en",
2627
"use_agentic_retrieval": false,
2728
"seed": 1
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
{
2+
"testdata_path": "ground_truth_multimodal.jsonl",
3+
"results_dir": "results_multimodal/experiment<TIMESTAMP>",
4+
"requested_metrics": ["gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"],
5+
"target_url": "http://localhost:50505/chat",
6+
"target_parameters": {
7+
"overrides": {
8+
"top": 3,
9+
"max_subqueries": 10,
10+
"results_merge_strategy": "interleaved",
11+
"temperature": 0.3,
12+
"minimum_reranker_score": 0,
13+
"minimum_search_score": 0,
14+
"retrieval_mode": "hybrid",
15+
"semantic_ranker": true,
16+
"semantic_captions": false,
17+
"query_rewriting": false,
18+
"reasoning_effort": "minimal",
19+
"suggest_followup_questions": false,
20+
"use_oid_security_filter": false,
21+
"use_groups_security_filter": false,
22+
"search_text_embeddings": true,
23+
"search_image_embeddings": true,
24+
"send_text_sources": true,
25+
"send_image_sources": true,
26+
"language": "en",
27+
"use_agentic_retrieval": false,
28+
"seed": 1
29+
}
30+
},
31+
"target_response_answer_jmespath": "message.content",
32+
"target_response_context_jmespath": "context.data_points.text"
33+
}
Lines changed: 10 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,10 @@
1-
{"question": "How closely do the S&P 500 and NASDAQ move together?",
2-
"truth": "The S&P 500 and NASDAQ move very closely together, with a correlation coefficient of 0.95, indicating a strong positive relationship between the two indices [Financial Market Analysis Report 2023.pdf#page=7]."
3-
}
4-
{"question": "Which commodity—oil, gold, or wheat—was the most stable over the last decade?",
5-
"truth": "Over the last decade, gold was the most stable commodity compared to oil and wheat. The annual percentage changes for gold mostly stayed within a smaller range, while oil showed significant fluctuations including a large negative change in 2014 and a large positive peak in 2021. Wheat also varied but less than oil and more than gold [Financial Market Analysis Report 2023.pdf#page=6][Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."
6-
}
7-
{"question": "Do cryptocurrencies like Bitcoin or Ethereum show stronger ties to stocks or commodities?",
8-
"truth": "Cryptocurrencies like Bitcoin and Ethereum show stronger ties to stocks than to commodities. The correlation values between Bitcoin and stock indices are 0.3 with the S&P 500 and 0.4 with NASDAQ, while for Ethereum, the correlations are 0.35 with the S&P 500 and 0.45 with NASDAQ. In contrast, the correlations with commodities like Oil are lower (0.2 for Bitcoin and 0.25 for Ethereum), and correlations with Gold are slightly negative (-0.1 for Bitcoin and -0.05 for Ethereum) [Financial Market Analysis Report 2023.pdf#page=7]."
9-
}
10-
{"question": "Around what level did the S&P 500 reach its highest point before declining in 2021?",
11-
"truth": "The S&P 500 reached its highest point just above the 4500 level before declining in 2021 [Financial Market Analysis Report 2023.pdf#page=4][Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."
12-
}
13-
{"question": "In which month of 2023 did Bitcoin nearly hit 45,000?",
14-
"truth": "Bitcoin nearly hit 45,000 in December 2023, as shown by the blue line reaching close to 45,000 on the graph for that month [Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)]."
15-
}
16-
{
17-
"question": "Which year saw oil prices fall the most, and by roughly how much did they drop?",
18-
"truth": "The year that saw oil prices fall the most was 2020, with a drop of roughly 20% as shown by the blue bar extending to about -20% on the horizontal bar chart of annual percentage changes for Oil from 2014 to 2022 [Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."
19-
}
20-
{"question": "What was the approximate inflation rate in 2022?",
21-
"truth": "The approximate inflation rate in 2022 was near 3.4% according to the orange line in the inflation data on the graph showing trends from 2018 to 2023 [Financial Market Analysis Report 2023.pdf#page=8(figure8_1.png)]."
22-
}
23-
{"question": "By 2028, to what relative value are oil prices projected to move compared to their 2024 baseline of 100?",
24-
"truth" :"Oil prices are projected to decline to about 90 by 2028, relative to their 2024 baseline of 100. [Financial Market Analysis Report 2023.pdf#page=9(figure9_1.png)]."
25-
}
26-
{"question": "What approximate value did the S&P 500 fall to at its lowest point between 2018 and 2022?",
27-
"truth": "The S&P 500 fell in 2018 to an approximate value of around 2600 at its lowest point between 2018 and 2022, as shown by the graph depicting the 5-Year Trend of the S&P 500 Index [Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."
28-
}
29-
{"question": "Around what value did Ethereum finish the year at in 2023?",
30-
"truth": "Ethereum finished the year 2023 at a value around 2200, as indicated by the orange line on the price fluctuations graph for the last 12 months [Financial Market Analysis Report 2023.pdf#page=5][Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)][Financial Market Analysis Report 2023.pdf#page=5(figure5_2.png)]."
31-
}
1+
{"question": "Which commodity—oil, gold, or wheat—was the most stable over the last decade?", "truth": "Over the last decade, gold was the most stable commodity compared to oil and wheat. The annual percentage changes for gold mostly stayed within a smaller range, while oil showed significant fluctuations including a large negative change in 2014 and a large positive peak in 2021. Wheat also varied but less than oil and more than gold [Financial Market Analysis Report 2023.pdf#page=6][Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."}
2+
{"question": "Do cryptocurrencies like Bitcoin or Ethereum show stronger ties to stocks or commodities?", "truth": "Cryptocurrencies like Bitcoin and Ethereum show stronger ties to stocks than to commodities. The correlation values between Bitcoin and stock indices are 0.3 with the S&P 500 and 0.4 with NASDAQ, while for Ethereum, the correlations are 0.35 with the S&P 500 and 0.45 with NASDAQ. In contrast, the correlations with commodities like Oil are lower (0.2 for Bitcoin and 0.25 for Ethereum), and correlations with Gold are slightly negative (-0.1 for Bitcoin and -0.05 for Ethereum) [Financial Market Analysis Report 2023.pdf#page=7]."}
3+
{"question": "Around what level did the S&P 500 reach its highest point before declining in 2021?", "truth": "The S&P 500 reached its highest point just above the 4500 level before declining in 2021 [Financial Market Analysis Report 2023.pdf#page=4][Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."}
4+
{"question": "In which month of 2023 did Bitcoin nearly hit 45,000?", "truth": "Bitcoin nearly hit 45,000 in December 2023, as shown by the blue line reaching close to 45,000 on the graph for that month [Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)]."}
5+
{"question": "Which year saw oil prices fall the most, and by roughly how much did they drop?", "truth": "The year that saw oil prices fall the most was 2020, with a drop of roughly 20% as shown by the blue bar extending to about -20% on the horizontal bar chart of annual percentage changes for Oil from 2014 to 2022 [Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."}
6+
{"question": "What was the approximate inflation rate in 2022?", "truth": "The approximate inflation rate in 2022 was near 3.4% according to the orange line in the inflation data on the graph showing trends from 2018 to 2023 [Financial Market Analysis Report 2023.pdf#page=8(figure8_1.png)]."}
7+
{"question": "By 2028, to what relative value are oil prices projected to move compared to their 2024 baseline of 100?", "truth": "Oil prices are projected to decline to about 90 by 2028, relative to their 2024 baseline of 100. [Financial Market Analysis Report 2023.pdf#page=9(figure9_1.png)]."}
8+
{"question": "What approximate value did the S&P 500 fall to at its lowest point between 2018 and 2022?", "truth": "The S&P 500 fell in 2018 to an approximate value of around 2600 at its lowest point between 2018 and 2022, as shown by the graph depicting the 5-Year Trend of the S&P 500 Index [Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."}
9+
{"question": "Around what value did Ethereum finish the year at in 2023?", "truth": "Ethereum finished the year 2023 at a value around 2200, as indicated by the orange line on the price fluctuations graph for the last 12 months [Financial Market Analysis Report 2023.pdf#page=5][Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)][Financial Market Analysis Report 2023.pdf#page=5(figure5_2.png)]."}
10+
{"question": "What was the approximate GDP growth rate in 2021?", "truth": "The approximate GDP growth rate in 2021 was about 4.5% according to the line graph showing trends from 2018 to 2023 [Financial Market Analysis Report 2023.pdf#page=8(figure8_1.png)]."}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
{
2+
"testdata_path": "ground_truth_multimodal.jsonl",
3+
"results_dir": "results_multimodal/baseline-ask",
4+
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"],
5+
"target_url": "http://localhost:50505/ask",
6+
"target_parameters": {
7+
"overrides": {
8+
"top": 3,
9+
"max_subqueries": 10,
10+
"results_merge_strategy": "interleaved",
11+
"temperature": 0.3,
12+
"minimum_reranker_score": 0,
13+
"minimum_search_score": 0,
14+
"retrieval_mode": "hybrid",
15+
"semantic_ranker": true,
16+
"semantic_captions": false,
17+
"query_rewriting": false,
18+
"reasoning_effort": "minimal",
19+
"suggest_followup_questions": false,
20+
"use_oid_security_filter": false,
21+
"use_groups_security_filter": false,
22+
"search_text_embeddings": true,
23+
"search_image_embeddings": true,
24+
"send_text_sources": true,
25+
"send_image_sources": true,
26+
"language": "en",
27+
"use_agentic_retrieval": false,
28+
"seed": 1
29+
}
30+
},
31+
"target_response_answer_jmespath": "message.content",
32+
"target_response_context_jmespath": "context.data_points.text"
33+
}

0 commit comments

Comments
 (0)