Add notebook: Evaluating AI search engines with the judges library#257
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
| @@ -0,0 +1,1687 @@ | |||
| { | |||
There was a problem hiding this comment.
"judges is an open-source library to use and create LLM-as-a-Judge evaluators. It provides a set of curated, research-backed..."
"...a collection of real-world Google queries...as our benchmark for comparing..."
"...which only includes human evaluated answers and their corresponding queries for correctness, clarity, and completeness."
Reply via ReviewNB
| @@ -0,0 +1,1687 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1687 @@ | |||
| { | |||
There was a problem hiding this comment.
"...or from Google Colab secrets, in which case, uncomment the relevant code examples below."
Reply via ReviewNB
| @@ -0,0 +1,1687 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1687 @@ | |||
| { | |||
| @@ -0,0 +1,1687 @@ | |||
| { | |||
There was a problem hiding this comment.
Maybe clarify MTBenchChatBotResponseQuality is also a "grader" type of judge (not really clear right now). It can say something like "Response Quality Evaluation Grader"
Reply via ReviewNB
|
Hey Stephen! Thanks for the prompt feedback. I have incorporated your comments, added nb to |
stevhliu
left a comment
There was a problem hiding this comment.
Cool, thanks! Once @merveenoyan has had a chance to review, we can merge :)
| - [Multi-agent RAG System 🤖🤝🤖](multiagent_rag_system) | ||
| - [Multimodal RAG with ColQwen2, Reranker, and Quantized VLMs on Consumer GPUs](multimodal_rag_using_document_retrieval_and_reranker_and_vlms) | ||
| - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) | ||
| - [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) |
There was a problem hiding this comment.
Sorry I wasn't clear, we should keep the most recent ones (towards the bottom) and remove the one on top (Multimodal Retrieval-Augmented Generation (RAG) with Document Retrieval (ColPali) and Vision Language Models (VLMs))
There was a problem hiding this comment.
Modified it. Thanks for pointing that out.
|
@merveenoyan Happy New Year! Have you had the chance to take a look? |
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
We use the Natural Questions dataset -- a collection of real-world Google queries and corresponding Wikipedia articles -- as our benchmark for comparing the quality of different AI search engines, as follows:
this sentence is a bit too long and hard to follow, can we simplify it?
nit: open-source* (there's an s at the end)
Reply via ReviewNB
There was a problem hiding this comment.
How about this?
We use the Natural Questions dataset as our benchmark for comparing the quality of different AI search engines. Natural Questions is a collection of real-world Google queries and corresponding Wikipedia articles. We'll walk through the following process:
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
can you explain why you picked this instead of local serving? we often do local serving with open-source models in open-source cookbook
Reply via ReviewNB
There was a problem hiding this comment.
We picked this instead of local serving because it's a bit more lightweight and users don't need to have a machine available with local serving set up to get started. We'd love to add support for that in judges though in the future!
There was a problem hiding this comment.
Added a sentence to explain.
merveenoyan
left a comment
There was a problem hiding this comment.
Left very minor nits, we can merge afterwards!
Sorry for the delay, I was off!
@merveenoyan 👋🏼 thanks for reviewing! I'm going to shepherd this PR the rest of the way from our team. Will respond to your comments above and make updates + open a new PR if that's ok. |
Description
This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.
What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:
The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.
What This Notebook Does
Open-Source Tools & Resources
Why This Notebook?
This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.