-
Notifications
You must be signed in to change notification settings - Fork 5k
Update chat/ask prompts for improved consistency, run multimodal evals #2709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves consistency and quality of the chat and ask prompt templates by making them more uniform and adding explicit citation guidance. The changes update prompt text, fix tests to match new prompts, and include improvements for evaluating multimodal RAG capabilities.
- Updates prompt templates in both chat and ask approaches to use more consistent language and formatting
- Adds explicit citation guidance and possible citations list to prompt templates
- Enhances evaluation framework with improved citation matching and multimodal evaluation support
Reviewed Changes
Copilot reviewed 90 out of 91 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/test_app.py | Updates test assertion to match new prompt text starting phrase |
tests/snapshots/* | Updates all snapshot test files to reflect new prompt content |
scripts/pretty_print_jsonl.py | Adds new utility script for pretty-printing JSONL files |
evals/results_multimodal/* | Adds new multimodal evaluation results and configuration files |
If you cannot answer using the sources below, say you don't know. Use below example to answer. | ||
Assistant helps the company employees with their questions about internal documents. Be brief in your answers. | ||
Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. Do not generate answers that don't use the sources below. | ||
You CANNOT ask clarifying questions to the user, since the user will have no way to reply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt-5 can do negations right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Discussed offline as well). There's been some talk of avoiding negations in prompting LLMs, but I haven't seen any strong evidence that this is a big issue. I also checked the evaluation results for the "ask" approach and I don't see any follow-up questions in those results.
Purpose
This PR cleans up the chat and ask prompts to be more consistent with each other, and makes sure they both include full instructions about how to format citations, along with the list of allowed citations for the current question.
I've updated the snapshot tests and run evaluations on the new prompts. As part of that, I improved the support for evaluating multimodal RAG and evaluated various multimodal options on a new ground truth data set based off the sample PPT deck about financial data.
The groundedness metric is low due to the groundedness evaluator's lack of support for image sources. The citations_matched metric is the more reliable one.
Prompt and Configuration Updates
app/backend/approaches/prompts/ask_answer_question.prompty
andchat_answer_question.prompty
to clarify citation requirements, improve language handling, and ensure sources are referenced in a consistent format for both text and images. [1] [2]Evaluation Enhancements
evals/evaluate_config.json
to the latest default parameters that the application sends.CITATION_REGEX
) inevals/evaluate.py
to robustly match citations for both text and image sources, supporting multiple file types and citation formats. This improves the accuracy of citation detection in evaluation metrics.any_citation
andcitations_matched
) inevals/evaluate.py
to use the new regex, ensuring more reliable extraction and comparison of citations between ground truth and model responses. [1] [2]Multimodal Evaluation Support
evals/ground_truth_multimodal.jsonl
, containing questions and answers that require both text and image sources for evaluation, enabling thorough testing of multimodal RAG capabilities.evals/evaluate_config_multimodal.json
and a sample baseline config for multimodal evaluation, specifying relevant metrics and parameters for experiments involving both text and images. [1] [2]docs/evaluation.md
) to explain how to run and interpret multimodal RAG evaluations, including caveats about groundedness metrics for multimodal answers.Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest
).python -m pytest --cov
to verify 100% coverage of added linespython -m mypy
to check for type errorsruff
andblack
manually on my code.