Skip to content

Commit 14482a5

Browse files
Adding Custom LLM as a Judge (#291)
* Updating Evaluator Notebooks * Signing off Signed-off-by: Chris Alexiuk <[email protected]> * Added Llama 3.3 Nemotron Super 49B examples * Custom LLM as a Judge Notebook Signed-off-by: Chris Alexiuk <[email protected]> * Custom LLM as a Judge Notebook Signed-off-by: Chris Alexiuk <[email protected]> * Custom LLM as a Judge Notebook Signed-off-by: Chris Alexiuk <[email protected]> * Custom LLM as a Judge Notebook Signed-off-by: Chris Alexiuk <[email protected]> * Custom LLM as a Judge Notebook Signed-off-by: Chris Alexiuk <[email protected]> --------- Signed-off-by: Chris Alexiuk <[email protected]> Co-authored-by: Chris Alexiuk <[email protected]>
1 parent d6ef188 commit 14482a5

File tree

4 files changed

+1105
-0
lines changed

4 files changed

+1105
-0
lines changed
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Custom LLM-as-a-Judge Implementation
2+
3+
This repository demonstrates how to leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice for evaluating LLM outputs. The example focuses on evaluating medical consultation summaries using a combination of Llama 3.1 70B model for generation and OpenAI's GPT-4.1 as the judge.
4+
5+
## Overview
6+
7+
The implementation evaluates medical consultation summaries on two key metrics:
8+
- **Completeness**: How well the summary captures all critical information (rated 1-5)
9+
- **Correctness**: How accurate the summary is without false information (rated 1-5)
10+
11+
## Prerequisites
12+
13+
- NeMo Microservices setup including:
14+
- NeMo Evaluator
15+
- NeMo Data Store
16+
- NeMo Entity Store
17+
- API keys for:
18+
- OpenAI (for the judge LLM)
19+
- NVIDIA build.nvidia.com (for the target model)
20+
21+
## Project Structure
22+
23+
The project uses a JSONL file containing synthetic medical consultation data with the following structure:
24+
```json
25+
{
26+
"ID": "C012",
27+
"content": "Date: 2025-04-12\nChief Complaint (CC): ...",
28+
"summary": "New Clinical Problem: ..."
29+
}
30+
```
31+
32+
## Key Components
33+
34+
1. **Judge LLM Configuration**
35+
- Uses GPT-4.1 as the judge
36+
- Custom prompt templates for evaluating completeness and correctness
37+
- Regex-based score extraction
38+
39+
2. **Target Model Configuration**
40+
- Uses Llama 3.1 70B for generating summaries
41+
- Configured through NVIDIA's build.nvidia.com API
42+
43+
3. **Evaluation Process**
44+
- Generates summaries using the target model
45+
- Judges the summaries using the judge LLM
46+
- Aggregates scores for both metrics
47+
48+
## Results
49+
50+
The evaluation provides scores on a scale of 1-5 for both completeness and correctness, with detailed statistics including:
51+
- Mean scores
52+
- Total count of evaluations
53+
- Sum of scores
54+
55+
## Dependencies
56+
57+
See `pyproject.toml` for a complete list of dependencies. Key requirements include:
58+
- datasets>=3.5.0
59+
- huggingface-hub>=0.30.2
60+
- openai>=1.76.0
61+
- transformers>=4.36.0
62+
63+
You can run `uv sync` to produce the required `.venv`!
64+
65+
## Documentation
66+
67+
For more detailed information about Custom LLM-as-a-Judge evaluation, refer to the [official NeMo documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge).

0 commit comments

Comments
 (0)