Skip to content

Conversation

pamelafox
Copy link
Collaborator

@pamelafox pamelafox commented Sep 4, 2025

Purpose

This PR cleans up the chat and ask prompts to be more consistent with each other, and makes sure they both include full instructions about how to format citations, along with the list of allowed citations for the current question.

I've updated the snapshot tests and run evaluations on the new prompts. As part of that, I improved the support for evaluating multimodal RAG and evaluated various multimodal options on a new ground truth data set based off the sample PPT deck about financial data.

metric stat baseline baseline-ask no-image-embeddings no-image-sources
gpt_groundedness mean_rating 4.2 4.1 4.1 4.9
pass_rate 0.7 0.6 0.7 1.0
gpt_relevance mean_rating 4.4 4.1 4.5 4.0
pass_rate 1.0 0.9 1.0 0.9
answer_length mean 279.4 254.5 277.9 281.5
latency mean 4.31 4.83 3.82 2.88
citations_matched rate 0.97 0.83 0.88 0.23
any_citation rate 1.0 1.0 1.0 1.0

The groundedness metric is low due to the groundedness evaluator's lack of support for image sources. The citations_matched metric is the more reliable one.

Prompt and Configuration Updates

  • Refined prompt templates in app/backend/approaches/prompts/ask_answer_question.prompty and chat_answer_question.prompty to clarify citation requirements, improve language handling, and ensure sources are referenced in a consistent format for both text and images. [1] [2]

Evaluation Enhancements

  • Improved configuration options in evals/evaluate_config.json to the latest default parameters that the application sends.
  • Added a comprehensive regex (CITATION_REGEX) in evals/evaluate.py to robustly match citations for both text and image sources, supporting multiple file types and citation formats. This improves the accuracy of citation detection in evaluation metrics.
  • Updated citation metrics (any_citation and citations_matched) in evals/evaluate.py to use the new regex, ensuring more reliable extraction and comparison of citations between ground truth and model responses. [1] [2]

Multimodal Evaluation Support

  • Added a new ground truth file, evals/ground_truth_multimodal.jsonl, containing questions and answers that require both text and image sources for evaluation, enabling thorough testing of multimodal RAG capabilities.
  • Introduced evals/evaluate_config_multimodal.json and a sample baseline config for multimodal evaluation, specifying relevant metrics and parameters for experiments involving both text and images. [1] [2]
  • Updated documentation (docs/evaluation.md) to explain how to run and interpret multimodal RAG evaluations, including caveats about groundedness metrics for multimodal answers.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[X] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves consistency and quality of the chat and ask prompt templates by making them more uniform and adding explicit citation guidance. The changes update prompt text, fix tests to match new prompts, and include improvements for evaluating multimodal RAG capabilities.

  • Updates prompt templates in both chat and ask approaches to use more consistent language and formatting
  • Adds explicit citation guidance and possible citations list to prompt templates
  • Enhances evaluation framework with improved citation matching and multimodal evaluation support

Reviewed Changes

Copilot reviewed 90 out of 91 changed files in this pull request and generated no comments.

File Description
tests/test_app.py Updates test assertion to match new prompt text starting phrase
tests/snapshots/* Updates all snapshot test files to reflect new prompt content
scripts/pretty_print_jsonl.py Adds new utility script for pretty-printing JSONL files
evals/results_multimodal/* Adds new multimodal evaluation results and configuration files

If you cannot answer using the sources below, say you don't know. Use below example to answer.
Assistant helps the company employees with their questions about internal documents. Be brief in your answers.
Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below.
You CANNOT ask clarifying questions to the user, since the user will have no way to reply.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt-5 can do negations right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Discussed offline as well). There's been some talk of avoiding negations in prompting LLMs, but I haven't seen any strong evidence that this is a big issue. I also checked the evaluation results for the "ask" approach and I don't see any follow-up questions in those results.

@pamelafox pamelafox merged commit d08040f into Azure-Samples:main Sep 5, 2025
44 of 45 checks passed
@pamelafox pamelafox deleted the answersources branch September 5, 2025 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants