|
| 1 | +# Custom LLM-as-a-Judge Implementation |
| 2 | + |
| 3 | +This repository demonstrates how to leverage Custom LLM-as-a-Judge through NeMo Evaluator Microservice for evaluating LLM outputs. The example focuses on evaluating medical consultation summaries using a combination of Llama 3.1 70B model for generation and OpenAI's GPT-4.1 as the judge. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The implementation evaluates medical consultation summaries on two key metrics: |
| 8 | +- **Completeness**: How well the summary captures all critical information (rated 1-5) |
| 9 | +- **Correctness**: How accurate the summary is without false information (rated 1-5) |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | + |
| 13 | +- NeMo Microservices setup including: |
| 14 | + - NeMo Evaluator |
| 15 | + - NeMo Data Store |
| 16 | + - NeMo Entity Store |
| 17 | +- API keys for: |
| 18 | + - OpenAI (for the judge LLM) |
| 19 | + - NVIDIA build.nvidia.com (for the target model) |
| 20 | + |
| 21 | +## Project Structure |
| 22 | + |
| 23 | +The project uses a JSONL file containing synthetic medical consultation data with the following structure: |
| 24 | +```json |
| 25 | +{ |
| 26 | + "ID": "C012", |
| 27 | + "content": "Date: 2025-04-12\nChief Complaint (CC): ...", |
| 28 | + "summary": "New Clinical Problem: ..." |
| 29 | +} |
| 30 | +``` |
| 31 | + |
| 32 | +## Key Components |
| 33 | + |
| 34 | +1. **Judge LLM Configuration** |
| 35 | + - Uses GPT-4.1 as the judge |
| 36 | + - Custom prompt templates for evaluating completeness and correctness |
| 37 | + - Regex-based score extraction |
| 38 | + |
| 39 | +2. **Target Model Configuration** |
| 40 | + - Uses Llama 3.1 70B for generating summaries |
| 41 | + - Configured through NVIDIA's build.nvidia.com API |
| 42 | + |
| 43 | +3. **Evaluation Process** |
| 44 | + - Generates summaries using the target model |
| 45 | + - Judges the summaries using the judge LLM |
| 46 | + - Aggregates scores for both metrics |
| 47 | + |
| 48 | +## Results |
| 49 | + |
| 50 | +The evaluation provides scores on a scale of 1-5 for both completeness and correctness, with detailed statistics including: |
| 51 | +- Mean scores |
| 52 | +- Total count of evaluations |
| 53 | +- Sum of scores |
| 54 | + |
| 55 | +## Dependencies |
| 56 | + |
| 57 | +See `pyproject.toml` for a complete list of dependencies. Key requirements include: |
| 58 | +- datasets>=3.5.0 |
| 59 | +- huggingface-hub>=0.30.2 |
| 60 | +- openai>=1.76.0 |
| 61 | +- transformers>=4.36.0 |
| 62 | + |
| 63 | +You can run `uv sync` to produce the required `.venv`! |
| 64 | + |
| 65 | +## Documentation |
| 66 | + |
| 67 | +For more detailed information about Custom LLM-as-a-Judge evaluation, refer to the [official NeMo documentation](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-custom.html#evaluation-with-llm-as-a-judge). |
0 commit comments