Update chat/ask prompts for improved consistency, run multimodal evals #2709

pamelafox · 2025-09-04T18:36:10Z

Purpose

This PR cleans up the chat and ask prompts to be more consistent with each other, and makes sure they both include full instructions about how to format citations, along with the list of allowed citations for the current question.

I've updated the snapshot tests and run evaluations on the new prompts. As part of that, I improved the support for evaluating multimodal RAG and evaluated various multimodal options on a new ground truth data set based off the sample PPT deck about financial data.

metric	stat	baseline	baseline-ask	no-image-embeddings	no-image-sources
gpt_groundedness	mean_rating	4.2	4.1	4.1	4.9
↑	pass_rate	0.7	0.6	0.7	1.0
gpt_relevance	mean_rating	4.4	4.1	4.5	4.0
↑	pass_rate	1.0	0.9	1.0	0.9
answer_length	mean	279.4	254.5	277.9	281.5
latency	mean	4.31	4.83	3.82	2.88
citations_matched	rate	0.97	0.83	0.88	0.23
any_citation	rate	1.0	1.0	1.0	1.0

The groundedness metric is low due to the groundedness evaluator's lack of support for image sources. The citations_matched metric is the more reliable one.

Prompt and Configuration Updates

Refined prompt templates in app/backend/approaches/prompts/ask_answer_question.prompty and chat_answer_question.prompty to clarify citation requirements, improve language handling, and ensure sources are referenced in a consistent format for both text and images. [1] [2]

Evaluation Enhancements

Improved configuration options in evals/evaluate_config.json to the latest default parameters that the application sends.
Added a comprehensive regex (CITATION_REGEX) in evals/evaluate.py to robustly match citations for both text and image sources, supporting multiple file types and citation formats. This improves the accuracy of citation detection in evaluation metrics.
Updated citation metrics (any_citation and citations_matched) in evals/evaluate.py to use the new regex, ensuring more reliable extraction and comparison of citations between ground truth and model responses. [1] [2]

Multimodal Evaluation Support

Added a new ground truth file, evals/ground_truth_multimodal.jsonl, containing questions and answers that require both text and image sources for evaluation, enabling thorough testing of multimodal RAG capabilities.
Introduced evals/evaluate_config_multimodal.json and a sample baseline config for multimodal evaluation, specifying relevant metrics and parameters for experiments involving both text and images. [1] [2]
Updated documentation (docs/evaluation.md) to explain how to run and interpret multimodal RAG evaluations, including caveats about groundedness metrics for multimodal answers.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

Copilot

Pull Request Overview

This PR improves consistency and quality of the chat and ask prompt templates by making them more uniform and adding explicit citation guidance. The changes update prompt text, fix tests to match new prompts, and include improvements for evaluating multimodal RAG capabilities.

Updates prompt templates in both chat and ask approaches to use more consistent language and formatting
Adds explicit citation guidance and possible citations list to prompt templates
Enhances evaluation framework with improved citation matching and multimodal evaluation support

Reviewed Changes

Copilot reviewed 90 out of 91 changed files in this pull request and generated no comments.

File	Description
tests/test_app.py	Updates test assertion to match new prompt text starting phrase
tests/snapshots/*	Updates all snapshot test files to reflect new prompt content
scripts/pretty_print_jsonl.py	Adds new utility script for pretty-printing JSONL files
evals/results_multimodal/*	Adds new multimodal evaluation results and configuration files

mattgotteiner · 2025-09-04T22:08:43Z

app/backend/approaches/prompts/ask_answer_question.prompty

-If you cannot answer using the sources below, say you don't know. Use below example to answer.
+Assistant helps the company employees with their questions about internal documents. Be brief in your answers.
+Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below.
+You CANNOT ask clarifying questions to the user, since the user will have no way to reply.


gpt-5 can do negations right

(Discussed offline as well). There's been some talk of avoiding negations in prompting LLMs, but I haven't seen any strong evidence that this is a big issue. I also checked the evaluation results for the "ask" approach and I don't see any follow-up questions in those results.

pamelafox added 3 commits September 4, 2025 09:54

Add multimodal JSONL ground truth

9e0c81d

Update prompts and evals for multimodal

ce0f7c1

Add ask evaluation, update snapshots

0ee4f21

pamelafox requested review from Copilot and mattgotteiner September 4, 2025 18:57

Copilot AI reviewed Sep 4, 2025

View reviewed changes

mattgotteiner reviewed Sep 4, 2025

View reviewed changes

mattgotteiner approved these changes Sep 5, 2025

View reviewed changes

Merge branch 'main' into answersources

e3c5310

pamelafox merged commit d08040f into Azure-Samples:main Sep 5, 2025
44 of 45 checks passed

pamelafox deleted the answersources branch September 5, 2025 05:55

pamelafox mentioned this pull request Sep 5, 2025

Update ask_answer_question.prompty to include instruction to list sources in square brackets #2473 #2476

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update chat/ask prompts for improved consistency, run multimodal evals #2709

Update chat/ask prompts for improved consistency, run multimodal evals #2709

Uh oh!

pamelafox commented Sep 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mattgotteiner Sep 4, 2025

Uh oh!

pamelafox Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Update chat/ask prompts for improved consistency, run multimodal evals #2709

Update chat/ask prompts for improved consistency, run multimodal evals #2709

Uh oh!

Conversation

pamelafox commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mattgotteiner Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

pamelafox Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pamelafox commented Sep 4, 2025 •

edited

Loading