-
Notifications
You must be signed in to change notification settings - Fork 5k
Update chat/ask prompts for improved consistency, run multimodal evals #2709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
{ | ||
"testdata_path": "ground_truth_multimodal.jsonl", | ||
"results_dir": "results_multimodal/experiment<TIMESTAMP>", | ||
"requested_metrics": ["gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"], | ||
"target_url": "http://localhost:50505/chat", | ||
"target_parameters": { | ||
"overrides": { | ||
"top": 3, | ||
"max_subqueries": 10, | ||
"results_merge_strategy": "interleaved", | ||
"temperature": 0.3, | ||
"minimum_reranker_score": 0, | ||
"minimum_search_score": 0, | ||
"retrieval_mode": "hybrid", | ||
"semantic_ranker": true, | ||
"semantic_captions": false, | ||
"query_rewriting": false, | ||
"reasoning_effort": "minimal", | ||
"suggest_followup_questions": false, | ||
"use_oid_security_filter": false, | ||
"use_groups_security_filter": false, | ||
"search_text_embeddings": true, | ||
"search_image_embeddings": true, | ||
"send_text_sources": true, | ||
"send_image_sources": true, | ||
"language": "en", | ||
"use_agentic_retrieval": false, | ||
"seed": 1 | ||
} | ||
}, | ||
"target_response_answer_jmespath": "message.content", | ||
"target_response_context_jmespath": "context.data_points.text" | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{"question": "Which commodity—oil, gold, or wheat—was the most stable over the last decade?", "truth": "Over the last decade, gold was the most stable commodity compared to oil and wheat. The annual percentage changes for gold mostly stayed within a smaller range, while oil showed significant fluctuations including a large negative change in 2014 and a large positive peak in 2021. Wheat also varied but less than oil and more than gold [Financial Market Analysis Report 2023.pdf#page=6][Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."} | ||
{"question": "Do cryptocurrencies like Bitcoin or Ethereum show stronger ties to stocks or commodities?", "truth": "Cryptocurrencies like Bitcoin and Ethereum show stronger ties to stocks than to commodities. The correlation values between Bitcoin and stock indices are 0.3 with the S&P 500 and 0.4 with NASDAQ, while for Ethereum, the correlations are 0.35 with the S&P 500 and 0.45 with NASDAQ. In contrast, the correlations with commodities like Oil are lower (0.2 for Bitcoin and 0.25 for Ethereum), and correlations with Gold are slightly negative (-0.1 for Bitcoin and -0.05 for Ethereum) [Financial Market Analysis Report 2023.pdf#page=7]."} | ||
{"question": "Around what level did the S&P 500 reach its highest point before declining in 2021?", "truth": "The S&P 500 reached its highest point just above the 4500 level before declining in 2021 [Financial Market Analysis Report 2023.pdf#page=4][Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."} | ||
{"question": "In which month of 2023 did Bitcoin nearly hit 45,000?", "truth": "Bitcoin nearly hit 45,000 in December 2023, as shown by the blue line reaching close to 45,000 on the graph for that month [Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)]."} | ||
{"question": "Which year saw oil prices fall the most, and by roughly how much did they drop?", "truth": "The year that saw oil prices fall the most was 2020, with a drop of roughly 20% as shown by the blue bar extending to about -20% on the horizontal bar chart of annual percentage changes for Oil from 2014 to 2022 [Financial Market Analysis Report 2023.pdf#page=6(figure6_1.png)]."} | ||
{"question": "What was the approximate inflation rate in 2022?", "truth": "The approximate inflation rate in 2022 was near 3.4% according to the orange line in the inflation data on the graph showing trends from 2018 to 2023 [Financial Market Analysis Report 2023.pdf#page=8(figure8_1.png)]."} | ||
{"question": "By 2028, to what relative value are oil prices projected to move compared to their 2024 baseline of 100?", "truth": "Oil prices are projected to decline to about 90 by 2028, relative to their 2024 baseline of 100. [Financial Market Analysis Report 2023.pdf#page=9(figure9_1.png)]."} | ||
{"question": "What approximate value did the S&P 500 fall to at its lowest point between 2018 and 2022?", "truth": "The S&P 500 fell in 2018 to an approximate value of around 2600 at its lowest point between 2018 and 2022, as shown by the graph depicting the 5-Year Trend of the S&P 500 Index [Financial Market Analysis Report 2023.pdf#page=4(figure4_1.png)]."} | ||
{"question": "Around what value did Ethereum finish the year at in 2023?", "truth": "Ethereum finished the year 2023 at a value around 2200, as indicated by the orange line on the price fluctuations graph for the last 12 months [Financial Market Analysis Report 2023.pdf#page=5][Financial Market Analysis Report 2023.pdf#page=5(figure5_1.png)][Financial Market Analysis Report 2023.pdf#page=5(figure5_2.png)]."} | ||
{"question": "What was the approximate GDP growth rate in 2021?", "truth": "The approximate GDP growth rate in 2021 was about 4.5% according to the line graph showing trends from 2018 to 2023 [Financial Market Analysis Report 2023.pdf#page=8(figure8_1.png)]."} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
{ | ||
"testdata_path": "ground_truth.jsonl", | ||
"results_dir": "results/baseline-ask", | ||
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"], | ||
"target_url": "http://localhost:50505/ask", | ||
"target_parameters": { | ||
"overrides": { | ||
"top": 3, | ||
"max_subqueries": 10, | ||
"results_merge_strategy": "interleaved", | ||
"temperature": 0.3, | ||
"minimum_reranker_score": 0, | ||
"minimum_search_score": 0, | ||
"retrieval_mode": "hybrid", | ||
"semantic_ranker": true, | ||
"semantic_captions": false, | ||
"query_rewriting": false, | ||
"reasoning_effort": "minimal", | ||
"suggest_followup_questions": false, | ||
"use_oid_security_filter": false, | ||
"use_groups_security_filter": false, | ||
"search_text_embeddings": true, | ||
"search_image_embeddings": true, | ||
"send_text_sources": true, | ||
"send_image_sources": true, | ||
"language": "en", | ||
"use_agentic_retrieval": false, | ||
"seed": 1 | ||
} | ||
}, | ||
"target_response_answer_jmespath": "message.content", | ||
"target_response_context_jmespath": "context.data_points.text" | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt-5 can do negations right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Discussed offline as well). There's been some talk of avoiding negations in prompting LLMs, but I haven't seen any strong evidence that this is a big issue. I also checked the evaluation results for the "ask" approach and I don't see any follow-up questions in those results.