Add notebook for "Evaluating AI Search Engines with the judges Library"#270
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly? |
👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother. I’m happy to copy over the comments from the previous PR as well if that helps! Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me. |
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
merveenoyan
left a comment
There was a problem hiding this comment.
I just left some nits, otherwise looks good! @stevhliu should review too
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
| @@ -0,0 +1,1680 @@ | |||
| { | |||
There was a problem hiding this comment.
It may be easier to consume this content in table-form
| Judge | What | Why | Source | When to use |
|---|---|---|---|---|
| PollMultihopCorrectness | | | | |
| PrometheusAbsoluteCoarseCorrectness | | | | |
| MTBenchChatBotResponseQuality | | | | |
Reply via ReviewNB
stevhliu
left a comment
There was a problem hiding this comment.
Thanks, just a few more comments and then we can merge! 🤗
notebooks/en/index.md
Outdated
| - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct) | ||
| - [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm) | ||
| - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl) | ||
| - [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library) |
There was a problem hiding this comment.
I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy
a757aa3 to
d470712
Compare
|
Think this should be cleaned up now! |
098d3e0 to
8499a94
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Description
This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.
This PR is a continuation of #257 -- shepherding the PR across!
What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:
The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.
What This Notebook Does
Open-Source Tools & Resources
Why This Notebook?
This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.