Skip to content

Commit 0da6fd5

Browse files
authored
Update README.md
1 parent f7b3201 commit 0da6fd5

File tree

1 file changed

+1
-3
lines changed

1 file changed

+1
-3
lines changed

README.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,7 @@ To ensure reliability, the JudgeIt framework has been rigorously tested against
3535
For RAG evaluation, this process involved building a dataset of thousands of real-life Q&A pairs in Enterprise setting, then collected golden answers, RAG answers, and human evaluations of the similarity between the RAG and golden answers. Using Meta’s Llama-3–70b as an LLM Judge, JudgeIt was able to show consistely above 90% F1 scores across different RAG pipeline evaluations compared to human evaluations with 20+ enterprise Q&A tasks.
3636

3737
#### Query-Rewrite Evaluation Reliability Metrics (2 Turn)
38-
39-
![Uploading Screenshot 2024-10-11 at 12.51.04 AM.png…]()
40-
38+
<img width="709" alt="Screenshot 2024-10-11 at 12 51 04 AM" src="https://github.com/user-attachments/assets/67d5dff9-82e5-45eb-979a-54079511032c">
4139

4240

4341
For Multi-turn evaluation, this process involved building a dataset of user queries, conversation memory history including a previous question and previous answer, golden rewritten queries, generated rewritten queries, and human evaluations of the similarity between the generated rewritten queries and golden answers. Using Meta’s Llama-3–70b as an LLM Judge, JudgeIt was able to achieve the near 100% precision.

0 commit comments

Comments
 (0)