spring-projects
diff --git a/‎advisors/evaluation-recursive-advisor-demo/.gitattributes‎
Lines changed: 2 additions & 0 deletions b/‎advisors/evaluation-recursive-advisor-demo/.gitattributes‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎advisors/evaluation-recursive-advisor-demo/.gitignore‎
Lines changed: 33 additions & 0 deletions b/‎advisors/evaluation-recursive-advisor-demo/.gitignore‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎advisors/evaluation-recursive-advisor-demo/.mvn/wrapper/maven-wrapper.properties‎
Lines changed: 3 additions & 0 deletions b/‎advisors/evaluation-recursive-advisor-demo/.mvn/wrapper/maven-wrapper.properties‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎advisors/evaluation-recursive-advisor-demo/README.md‎
Lines changed: 197 additions & 0 deletions b/‎advisors/evaluation-recursive-advisor-demo/README.md‎
Lines changed: 197 additions & 0 deletions
@@ -0,0 +1,2 @@
+/mvnw text eol=lf
+*.cmd text eol=crlf
@@ -0,0 +1,33 @@
+HELP.md
+target/
+.mvn/wrapper/maven-wrapper.jar
+!**/src/main/**/target/
+!**/src/test/**/target/
+
+### STS ###
+.apt_generated
+.classpath
+.factorypath
+.project
+.settings
+.springBeans
+.sts4-cache
+
+### IntelliJ IDEA ###
+.idea
+*.iws
+*.iml
+*.ipr
+
+### NetBeans ###
+/nbproject/private/
+/nbbuild/
+/dist/
+/nbdist/
+/.nb-gradle/
+build/
+!**/src/main/**/build/
+!**/src/test/**/build/
+
+### VS Code ###
+.vscode/
@@ -0,0 +1,3 @@
+wrapperVersion=3.3.4
+distributionType=only-script
+distributionUrl=https://repo.maven.apache.org/maven2/org/apache/maven/apache-maven/3.9.11/apache-maven-3.9.11-bin.zip
@@ -0,0 +1,197 @@
+# Spring AI LLM-as-a-Judge with Recursive Advisors Demo
+
+This project demonstrates how to implement **LLM-as-a-Judge** evaluation patterns using Spring AI's **Recursive Advisors**. It showcases automated quality assessment and self-refinement capabilities that enable AI systems to evaluate and improve their own responses iteratively.
+
+## Overview
+
+The demo implements a `SelfRefineEvaluationAdvisor` that:
+- Evaluates AI responses using a dedicated judge model
+- Uses a 4-point scoring system (1=terrible, 4=excellent)
+- Automatically retries failed responses with constructive feedback
+- Demonstrates bias mitigation by using separate models for generation and evaluation
+- Shows how recursive patterns enable self-improving AI systems
+
+## LLM-as-a-Judge Pattern
+
+LLM-as-a-Judge is an evaluation technique where Large Language Models assess the quality of outputs generated by other models. This approach:
+
+- **Scales evaluation** without human intervention
+- **Aligns with human judgment** up to 85% accuracy
+- **Provides structured feedback** for iterative improvement
+- **Enables automated quality control** in production systems
+
+### Evaluation Criteria
+
+The advisor evaluates responses based on:
+- **Relevance** - Direct addressing of the user's question
+- **Completeness** - Coverage of all aspects in the question
+- **Accuracy** - Correctness of information provided
+- **Clarity** - Readability and coherence of the response
+
+## Project Structure
+
+```
+src/main/java/com/example/advisor/
+├── EvaluationAdvisorDemoApplication.java    # Main application with configuration
+├── SelfRefineEvaluationAdvisor.java         # Recursive advisor implementation
+└── spring-ai-llm-as-judge-blog-post.md     # Detailed technical article
+```
+
+## Key Components
+
+### SelfRefineEvaluationAdvisor
+
+The core recursive advisor that implements the evaluation loop:
+
+```java
+SelfRefineEvaluationAdvisor.builder()
+    .chatClientBuilder(ChatClient.builder(judgeModel))  // Separate judge model
+    .maxRepeatAttempts(15)                              // Maximum retry attempts
+    .successRating(4)                                   // Minimum acceptable rating
+    .order(0)                                          // High priority in chain
+    .build()
+```
+
+**Features:**
+- **Recursive evaluation**: Uses `callAdvisorChain.copy(this).nextCall()` for iterative improvement
+- **Structured feedback**: Returns `EvaluationResponse(rating, evaluation, feedback)`
+- **Smart skip logic**: Avoids evaluating tool calls and non-textual responses
+- **Bias mitigation**: Uses separate ChatClient instance for evaluation
+- **Configurable thresholds**: Customizable success ratings and retry limits
+
+### Demo Application
+
+The application demonstrates a real-world scenario:
+- **Primary Model**: Anthropic Claude for response generation
+- **Judge Model**: Ollama (local model) for evaluation. The [Judge Arena](https://huggingface.co/blog/arena-atla) compares the LLM judge models.
+- **Tool Integration**: Weather tool with intentionally variable responses
+- **Logging**: Complete request/response observability
+
+## Prerequisites
+
+- Java 17+
+- Maven 3.6+
+- API access to Anthropic Claude
+- Ollama running locally (for the judge model)
+
+## Setup
+
+### 1. Install Ollama
+
+```bash
+# Install Ollama (macOS)
+brew install ollama
+
+# Start Ollama service
+ollama serve
+
+# Pull a suitable judge model
+ollama pull avcodes/flowaicom-flow-judge:q4
+```
+
+### 2. Configure API Keys
+
+Set your Anthropic API key as an environment variable:
+
+```bash
+export ANTHROPIC_API_KEY=your_anthropic_api_key
+```
+
+Or add to `src/main/resources/application.properties`:
+
+```properties
+spring.ai.anthropic.api-key=${ANTHROPIC_API_KEY}
+
+spring.ai.ollama.model-name=avcodes/flowaicom-flow-judge:q4
+
+spring.ai.chat.client.enabled=false
+
+```
+
+### 3. Run the Application
+
+```bash
+mvn spring-boot:run
+```
+
+## Expected Behavior
+
+The application will:
+
+1. **Generate Response**: Ask Claude about the weather in Paris
+2. **Tool Execution**: Call the weather tool (returns random temperature)
+3. **Evaluate Response**: Judge model scores the response (1-4 scale)
+4. **Retry if Needed**: If rating < 4, retry with feedback
+5. **Log Progress**: Show all attempts and evaluations
+6. **Return Final**: Best response after evaluation passes or max attempts reached
+
+### Sample Output
+
+```
+REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]
+
+>>> Tool Call responseTemp: -255
+Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
+ 
+>>> Tool Call responseTemp: 15  
+Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data
+
+RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.
+```
+
+## Configuration Options
+
+### Advisor Configuration
+
+```java
+SelfRefineEvaluationAdvisor.builder()
+    .successRating(3)           // Minimum rating (1-4)
+    .maxRepeatAttempts(5)       // Maximum retries
+    .order(0)                   // Execution order
+    .skipEvaluationPredicate((request, response) -> 
+        response.chatResponse().hasToolCalls())  // Skip conditions
+    .promptTemplate(customTemplate)  // Custom evaluation prompt
+    .build()
+```
+
+### Model Selection
+
+For optimal results:
+- **Generation Model**: High-quality models (GPT-4, Claude, Gemini)
+- **Judge Model**: Dedicated evaluation models from [Judge Arena Leaderboard](https://huggingface.co/spaces/AtlaAI/judge-arena)
+- **Bias Mitigation**: Always use different models for generation and evaluation
+
+
+## Production Considerations
+
+### Performance Optimization
+- Set reasonable `maxRepeatAttempts` (3-5) to balance quality and latency
+- Use faster judge models for high-throughput scenarios
+- Implement caching for repeated evaluations
+
+### Error Handling
+- Configure appropriate fallback strategies
+- Set up alerting for persistent evaluation failures
+- Monitor token usage and API quotas
+
+
+## Related Examples
+
+- [Recursive Advisor Demo](../recursive-advisor-demo) - Basic recursive patterns
+- [Spring AI Advisors Documentation](https://docs.spring.io/spring-ai/reference/api/advisors.html)
+- [ChatClient API Guide](https://docs.spring.io/spring-ai/reference/api/chatclient.html)
+
+## Learn More
+
+📖 **Blog Post**: [Building LLM Evaluation Systems with Spring AI](./spring-ai-llm-as-judge-blog-post.md)
+
+🔬 **Research Papers**:
+- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
+- [G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
+- [LLMs-as-Judges: A Comprehensive Survey](https://arxiv.org/abs/2412.05579)
+
+🏆 **Judge Models**: [Judge Arena Leaderboard](https://huggingface.co/spaces/AtlaAI/judge-arena)
+
+## License
+
+This project is licensed under the Apache License 2.0 - see the [LICENSE](../../LICENSE) file for details.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+/mvnw text eol=lf`
	`2`	`+*.cmd text eol=crlf`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+wrapperVersion=3.3.4`
	`2`	`+distributionType=only-script`
	`3`	`+distributionUrl=https://repo.maven.apache.org/maven2/org/apache/maven/apache-maven/3.9.11/apache-maven-3.9.11-bin.zip`