Skip to content

Commit db7a7f2

Browse files
authored
Add documentation for judge alignment with GEPA (#20213)
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
1 parent 43d91e5 commit db7a7f2

File tree

2 files changed

+72
-0
lines changed

2 files changed

+72
-0
lines changed

docs/docs/genai/eval-monitor/scorers/llm-judge/alignment.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,7 @@ else:
184184
MLflow provides multiple alignment optimizers:
185185

186186
- [**SIMBA**](/genai/eval-monitor/scorers/llm-judge/simba) (default) - Uses DSPy's SIMBA algorithm for prompt optimization
187+
- [**GEPA**](/genai/eval-monitor/scorers/llm-judge/gepa) - Uses DSPy's GEPA algorithm with LLM-driven reflection for iterative refinement
187188
- [**MemAlign**](/genai/eval-monitor/scorers/llm-judge/memalign) (experimental) - Uses a dual-memory system for fast and cheap few-shot alignment
188189

189190
You can also [create custom optimizers](/genai/eval-monitor/scorers/llm-judge/custom-optimizers) to implement domain-specific alignment strategies.
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# GEPA Alignment Optimizer
2+
3+
MLflow provides the **GEPA alignment optimizer** using [DSPy's implementation of GEPA](https://dspy.ai/api/optimizers/GEPA/overview/) (Genetic-Pareto). GEPA uses LLM-driven reflection to analyze execution traces and iteratively propose improved judge instructions based on human feedback.
4+
5+
## Requirements
6+
7+
For alignment to work:
8+
9+
- Traces must contain human assessments (labels) with the same name as the judge
10+
- Natural language feedback (rationale) is highly recommended for better alignment
11+
- Minimum of 10 traces with human assessments required
12+
- A mix of positive and negative labels is recommended
13+
14+
## Basic Usage
15+
16+
See [make_judge documentation](/genai/eval-monitor/scorers/llm-judge/make-judge) for details on creating judges.
17+
18+
```python
19+
import mlflow
20+
from mlflow.genai.judges import make_judge
21+
from mlflow.genai.judges.optimizers import GEPAAlignmentOptimizer
22+
23+
judge = make_judge(
24+
name="politeness",
25+
instructions=(
26+
"Given a user question, evaluate if the chatbot's response is polite and respectful. "
27+
"Consider the tone, language, and context of the response.\n\n"
28+
"Question: {{ inputs }}\n"
29+
"Response: {{ outputs }}"
30+
),
31+
feedback_value_type=bool,
32+
model="openai:/gpt-5-mini",
33+
)
34+
35+
traces_with_feedback = mlflow.search_traces(return_type="list")
36+
37+
optimizer = GEPAAlignmentOptimizer(
38+
model="openai:/gpt-5-mini",
39+
max_metric_calls=100,
40+
)
41+
aligned_judge = judge.align(traces_with_feedback, optimizer)
42+
```
43+
44+
## Parameters
45+
46+
| Parameter | Type | Default | Description |
47+
| ------------------ | ------ | ------- | --------------------------------------------------------------------------------------------------------------- |
48+
| `model` | `str` | `None` | Model used for reflection. If None, uses the default model. |
49+
| `max_metric_calls` | `int` | `None` | Maximum evaluation calls during optimization. If None, automatically set to 4x the number of training examples. |
50+
| `gepa_kwargs` | `dict` | `None` | Additional keyword arguments passed directly to `dspy.GEPA()` for advanced configuration. |
51+
52+
## When to Use GEPA
53+
54+
GEPA is particularly effective when:
55+
56+
- **Complex evaluation criteria**: Your judge needs to understand nuanced, context-dependent quality standards
57+
- **Rich textual feedback**: Human reviewers provide detailed explanations for their assessments
58+
- **Iterative refinement**: You want the optimizer to learn from failures and propose targeted improvements
59+
60+
For simpler alignment tasks, consider using the default [SIMBA optimizer](/genai/eval-monitor/scorers/llm-judge/simba).
61+
62+
## Debugging
63+
64+
To debug the optimization process, enable DEBUG logging:
65+
66+
```python
67+
import logging
68+
69+
logging.getLogger("mlflow.genai.judges.optimizers.gepa").setLevel(logging.DEBUG)
70+
aligned_judge = judge.align(traces_with_feedback, optimizer)
71+
```

0 commit comments

Comments
 (0)