|
| 1 | +# GEPA Alignment Optimizer |
| 2 | + |
| 3 | +MLflow provides the **GEPA alignment optimizer** using [DSPy's implementation of GEPA](https://dspy.ai/api/optimizers/GEPA/overview/) (Genetic-Pareto). GEPA uses LLM-driven reflection to analyze execution traces and iteratively propose improved judge instructions based on human feedback. |
| 4 | + |
| 5 | +## Requirements |
| 6 | + |
| 7 | +For alignment to work: |
| 8 | + |
| 9 | +- Traces must contain human assessments (labels) with the same name as the judge |
| 10 | +- Natural language feedback (rationale) is highly recommended for better alignment |
| 11 | +- Minimum of 10 traces with human assessments required |
| 12 | +- A mix of positive and negative labels is recommended |
| 13 | + |
| 14 | +## Basic Usage |
| 15 | + |
| 16 | +See [make_judge documentation](/genai/eval-monitor/scorers/llm-judge/make-judge) for details on creating judges. |
| 17 | + |
| 18 | +```python |
| 19 | +import mlflow |
| 20 | +from mlflow.genai.judges import make_judge |
| 21 | +from mlflow.genai.judges.optimizers import GEPAAlignmentOptimizer |
| 22 | + |
| 23 | +judge = make_judge( |
| 24 | + name="politeness", |
| 25 | + instructions=( |
| 26 | + "Given a user question, evaluate if the chatbot's response is polite and respectful. " |
| 27 | + "Consider the tone, language, and context of the response.\n\n" |
| 28 | + "Question: {{ inputs }}\n" |
| 29 | + "Response: {{ outputs }}" |
| 30 | + ), |
| 31 | + feedback_value_type=bool, |
| 32 | + model="openai:/gpt-5-mini", |
| 33 | +) |
| 34 | + |
| 35 | +traces_with_feedback = mlflow.search_traces(return_type="list") |
| 36 | + |
| 37 | +optimizer = GEPAAlignmentOptimizer( |
| 38 | + model="openai:/gpt-5-mini", |
| 39 | + max_metric_calls=100, |
| 40 | +) |
| 41 | +aligned_judge = judge.align(traces_with_feedback, optimizer) |
| 42 | +``` |
| 43 | + |
| 44 | +## Parameters |
| 45 | + |
| 46 | +| Parameter | Type | Default | Description | |
| 47 | +| ------------------ | ------ | ------- | --------------------------------------------------------------------------------------------------------------- | |
| 48 | +| `model` | `str` | `None` | Model used for reflection. If None, uses the default model. | |
| 49 | +| `max_metric_calls` | `int` | `None` | Maximum evaluation calls during optimization. If None, automatically set to 4x the number of training examples. | |
| 50 | +| `gepa_kwargs` | `dict` | `None` | Additional keyword arguments passed directly to `dspy.GEPA()` for advanced configuration. | |
| 51 | + |
| 52 | +## When to Use GEPA |
| 53 | + |
| 54 | +GEPA is particularly effective when: |
| 55 | + |
| 56 | +- **Complex evaluation criteria**: Your judge needs to understand nuanced, context-dependent quality standards |
| 57 | +- **Rich textual feedback**: Human reviewers provide detailed explanations for their assessments |
| 58 | +- **Iterative refinement**: You want the optimizer to learn from failures and propose targeted improvements |
| 59 | + |
| 60 | +For simpler alignment tasks, consider using the default [SIMBA optimizer](/genai/eval-monitor/scorers/llm-judge/simba). |
| 61 | + |
| 62 | +## Debugging |
| 63 | + |
| 64 | +To debug the optimization process, enable DEBUG logging: |
| 65 | + |
| 66 | +```python |
| 67 | +import logging |
| 68 | + |
| 69 | +logging.getLogger("mlflow.genai.judges.optimizers.gepa").setLevel(logging.DEBUG) |
| 70 | +aligned_judge = judge.align(traces_with_feedback, optimizer) |
| 71 | +``` |
0 commit comments