Skip to content

Commit d6da327

Browse files
authored
Add evaluation advisor demo (#82)
Signed-off-by: Christian Tzolov <[email protected]>
1 parent 57add81 commit d6da327

File tree

11 files changed

+1197
-0
lines changed

11 files changed

+1197
-0
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
/mvnw text eol=lf
2+
*.cmd text eol=crlf
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
HELP.md
2+
target/
3+
.mvn/wrapper/maven-wrapper.jar
4+
!**/src/main/**/target/
5+
!**/src/test/**/target/
6+
7+
### STS ###
8+
.apt_generated
9+
.classpath
10+
.factorypath
11+
.project
12+
.settings
13+
.springBeans
14+
.sts4-cache
15+
16+
### IntelliJ IDEA ###
17+
.idea
18+
*.iws
19+
*.iml
20+
*.ipr
21+
22+
### NetBeans ###
23+
/nbproject/private/
24+
/nbbuild/
25+
/dist/
26+
/nbdist/
27+
/.nb-gradle/
28+
build/
29+
!**/src/main/**/build/
30+
!**/src/test/**/build/
31+
32+
### VS Code ###
33+
.vscode/
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
wrapperVersion=3.3.4
2+
distributionType=only-script
3+
distributionUrl=https://repo.maven.apache.org/maven2/org/apache/maven/apache-maven/3.9.11/apache-maven-3.9.11-bin.zip
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Spring AI LLM-as-a-Judge with Recursive Advisors Demo
2+
3+
This project demonstrates how to implement **LLM-as-a-Judge** evaluation patterns using Spring AI's **Recursive Advisors**. It showcases automated quality assessment and self-refinement capabilities that enable AI systems to evaluate and improve their own responses iteratively.
4+
5+
## Overview
6+
7+
The demo implements a `SelfRefineEvaluationAdvisor` that:
8+
- Evaluates AI responses using a dedicated judge model
9+
- Uses a 4-point scoring system (1=terrible, 4=excellent)
10+
- Automatically retries failed responses with constructive feedback
11+
- Demonstrates bias mitigation by using separate models for generation and evaluation
12+
- Shows how recursive patterns enable self-improving AI systems
13+
14+
## LLM-as-a-Judge Pattern
15+
16+
LLM-as-a-Judge is an evaluation technique where Large Language Models assess the quality of outputs generated by other models. This approach:
17+
18+
- **Scales evaluation** without human intervention
19+
- **Aligns with human judgment** up to 85% accuracy
20+
- **Provides structured feedback** for iterative improvement
21+
- **Enables automated quality control** in production systems
22+
23+
### Evaluation Criteria
24+
25+
The advisor evaluates responses based on:
26+
- **Relevance** - Direct addressing of the user's question
27+
- **Completeness** - Coverage of all aspects in the question
28+
- **Accuracy** - Correctness of information provided
29+
- **Clarity** - Readability and coherence of the response
30+
31+
## Project Structure
32+
33+
```
34+
src/main/java/com/example/advisor/
35+
├── EvaluationAdvisorDemoApplication.java # Main application with configuration
36+
├── SelfRefineEvaluationAdvisor.java # Recursive advisor implementation
37+
└── spring-ai-llm-as-judge-blog-post.md # Detailed technical article
38+
```
39+
40+
## Key Components
41+
42+
### SelfRefineEvaluationAdvisor
43+
44+
The core recursive advisor that implements the evaluation loop:
45+
46+
```java
47+
SelfRefineEvaluationAdvisor.builder()
48+
.chatClientBuilder(ChatClient.builder(judgeModel)) // Separate judge model
49+
.maxRepeatAttempts(15) // Maximum retry attempts
50+
.successRating(4) // Minimum acceptable rating
51+
.order(0) // High priority in chain
52+
.build()
53+
```
54+
55+
**Features:**
56+
- **Recursive evaluation**: Uses `callAdvisorChain.copy(this).nextCall()` for iterative improvement
57+
- **Structured feedback**: Returns `EvaluationResponse(rating, evaluation, feedback)`
58+
- **Smart skip logic**: Avoids evaluating tool calls and non-textual responses
59+
- **Bias mitigation**: Uses separate ChatClient instance for evaluation
60+
- **Configurable thresholds**: Customizable success ratings and retry limits
61+
62+
### Demo Application
63+
64+
The application demonstrates a real-world scenario:
65+
- **Primary Model**: Anthropic Claude for response generation
66+
- **Judge Model**: Ollama (local model) for evaluation. The [Judge Arena](https://huggingface.co/blog/arena-atla) compares the LLM judge models.
67+
- **Tool Integration**: Weather tool with intentionally variable responses
68+
- **Logging**: Complete request/response observability
69+
70+
## Prerequisites
71+
72+
- Java 17+
73+
- Maven 3.6+
74+
- API access to Anthropic Claude
75+
- Ollama running locally (for the judge model)
76+
77+
## Setup
78+
79+
### 1. Install Ollama
80+
81+
```bash
82+
# Install Ollama (macOS)
83+
brew install ollama
84+
85+
# Start Ollama service
86+
ollama serve
87+
88+
# Pull a suitable judge model
89+
ollama pull avcodes/flowaicom-flow-judge:q4
90+
```
91+
92+
### 2. Configure API Keys
93+
94+
Set your Anthropic API key as an environment variable:
95+
96+
```bash
97+
export ANTHROPIC_API_KEY=your_anthropic_api_key
98+
```
99+
100+
Or add to `src/main/resources/application.properties`:
101+
102+
```properties
103+
spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
104+
105+
spring.ai.ollama.model-name=avcodes/flowaicom-flow-judge:q4
106+
107+
spring.ai.chat.client.enabled=false
108+
109+
```
110+
111+
### 3. Run the Application
112+
113+
```bash
114+
mvn spring-boot:run
115+
```
116+
117+
## Expected Behavior
118+
119+
The application will:
120+
121+
1. **Generate Response**: Ask Claude about the weather in Paris
122+
2. **Tool Execution**: Call the weather tool (returns random temperature)
123+
3. **Evaluate Response**: Judge model scores the response (1-4 scale)
124+
4. **Retry if Needed**: If rating < 4, retry with feedback
125+
5. **Log Progress**: Show all attempts and evaluations
126+
6. **Return Final**: Best response after evaluation passes or max attempts reached
127+
128+
### Sample Output
129+
130+
```
131+
REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]
132+
133+
>>> Tool Call responseTemp: -255
134+
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
135+
136+
>>> Tool Call responseTemp: 15
137+
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data
138+
139+
RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.
140+
```
141+
142+
## Configuration Options
143+
144+
### Advisor Configuration
145+
146+
```java
147+
SelfRefineEvaluationAdvisor.builder()
148+
.successRating(3) // Minimum rating (1-4)
149+
.maxRepeatAttempts(5) // Maximum retries
150+
.order(0) // Execution order
151+
.skipEvaluationPredicate((request, response) ->
152+
response.chatResponse().hasToolCalls()) // Skip conditions
153+
.promptTemplate(customTemplate) // Custom evaluation prompt
154+
.build()
155+
```
156+
157+
### Model Selection
158+
159+
For optimal results:
160+
- **Generation Model**: High-quality models (GPT-4, Claude, Gemini)
161+
- **Judge Model**: Dedicated evaluation models from [Judge Arena Leaderboard](https://huggingface.co/spaces/AtlaAI/judge-arena)
162+
- **Bias Mitigation**: Always use different models for generation and evaluation
163+
164+
165+
## Production Considerations
166+
167+
### Performance Optimization
168+
- Set reasonable `maxRepeatAttempts` (3-5) to balance quality and latency
169+
- Use faster judge models for high-throughput scenarios
170+
- Implement caching for repeated evaluations
171+
172+
### Error Handling
173+
- Configure appropriate fallback strategies
174+
- Set up alerting for persistent evaluation failures
175+
- Monitor token usage and API quotas
176+
177+
178+
## Related Examples
179+
180+
- [Recursive Advisor Demo](../recursive-advisor-demo) - Basic recursive patterns
181+
- [Spring AI Advisors Documentation](https://docs.spring.io/spring-ai/reference/api/advisors.html)
182+
- [ChatClient API Guide](https://docs.spring.io/spring-ai/reference/api/chatclient.html)
183+
184+
## Learn More
185+
186+
📖 **Blog Post**: [Building LLM Evaluation Systems with Spring AI](./spring-ai-llm-as-judge-blog-post.md)
187+
188+
🔬 **Research Papers**:
189+
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
190+
- [G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
191+
- [LLMs-as-Judges: A Comprehensive Survey](https://arxiv.org/abs/2412.05579)
192+
193+
🏆 **Judge Models**: [Judge Arena Leaderboard](https://huggingface.co/spaces/AtlaAI/judge-arena)
194+
195+
## License
196+
197+
This project is licensed under the Apache License 2.0 - see the [LICENSE](../../LICENSE) file for details.

0 commit comments

Comments
 (0)