Integrate GEPA into Curator by agdhruv · Pull Request #709 · bespokelabsai/curator

agdhruv · 2026-01-27T07:33:23Z

This PR introduces an integration between Curator and GEPA to enable automated prompt optimization.

Key Changes

The PR implements a new "optimizer" block that allows users to refine their LLM prompts through an evolutionary search process rather than manual trial and error.

GEPA Integration: Adds gepa as an optional dependency in pyproject.toml.
CuratorAdapter Implementation: Provides a bridge (CuratorAdapter) that maps Curator's LLM classes to GEPA’s optimization loop. It handles:
- Seed Extraction: Automatically pulls initial system prompts and templates from existing classes.
- Evaluation: Runs candidate prompts against datasets and captures "trajectories" (inputs, outputs, and scores) for feedback.
- Refinement: Uses GEPA's reflection mechanism to propose targeted prompt improvements based on score feedback.
Structured Evaluation: Defines EvaluationResult to pair numerical scores with natural language feedback, which helps the "Reflection LLM" understand how to improve the prompt.

Example Workflow

The PR includes a comprehensive example (gepa_example.py) demonstrating how to optimize a math word problem generator:

Generator: A curator.LLM class with a basic "seed" prompt.
Judge: A second LLM that acts as a grader, scoring clarity and correctness.
Optimization: GEPA runs multiple iterations, evolving the generator's prompt until it reaches the highest possible score from the judge.

Other Approaches Considered

DSPy Integration: I initially attempted to use the version of GEPA included in DSPy. While functional, the integration felt clunky and introduced unnecessary indirection in the call chain (Curator LLM $\rightarrow$ DSPy LM $\rightarrow$ GEPA). This led to a misalignment where we were optimizing a DSPy-specific prompt rather than the native Curator prompt. So, I moved toward a direct GEPA implementation for better transparency and control.

Points for Discussion

Model Naming Conventions: The reflection_lm required by gepa.compile() follows GEPA's model naming schema, which may differ from Curator's. How do we deal with this?
Use Case Validation: While technically functional, the utility of prompt optimization for dataset generation is not super clear because this task is non-verifiable. The current math problems example is a proof-of-concept; I'm open to suggestions for more robust, real-world use cases where evolutionary optimization provides a clear advantage.
User Expertise Assumptions: The current implementation assumes the end-user is familiar with GEPA's architecture and hyperparameter tuning.

Given the design decisions outlined above, this implementation should be seen as a first step. My goal is to provide a functional foundation so the team can experiment with the workflow. This will help us determine the most intuitive API to expose to users in the final release.

Note

Medium Risk
Introduces a new execution path that repeatedly calls LLMs and temporarily forces CURATOR_DISABLE_CACHE, plus a new LLM.clone() construction path; incorrect cloning or cache toggling could cause subtle behavior differences or performance issues.

Overview
Adds an optional GEPA integration to Curator by introducing CuratorAdapter (curator.blocks.gepa) that lets GEPA optimize a Curator LLM’s system_prompt, run candidate prompts against a dataset, score via a user-provided metric, and optionally emit reflection trajectories.

Extends LLM with a clone() method (and stores backend/backend_params) so the adapter can create per-candidate LLM instances; also updates curator.blocks exports to be conditional when the optimizer extra (new optional gepa dependency) is not installed. Includes new examples/optimizer/* showing GEPA optimization for an LLM-judged math problem generator and for code generation scored by test execution + static style checks.

^{Written by Cursor Bugbot for commit 4c7e800. This will update automatically on new commits. Configure here.}

src/bespokelabs/curator/blocks/__init__.py

shreyaspimpalgaonkar · 2026-01-30T16:56:19Z

Thanks for the PR @agdhruv. Can you share some sample results here?

agdhruv · 2026-02-01T09:03:28Z

Modifications from original PR

Improved user-facing API: simply takes the LLM instance and clones it internally to optimize the system prompt.
Optimize only system prompt: we don't optimize the user prompt since that usually contains named variable components that GEPA might remove or rename.
Added a significant new example: see below.

Results

Added a new example examples/optimizer/code_generation.py, where I mimic generating a dataset of Python code for programming problems. For high-quality data, we want to meet certain constraints in the generated solutions (e.g., should pass test cases, only markdown blocks, requires docstring, requires type hints, no print statements, etc.). The example shows how to optimize the system prompt to meet those constraints.

The optimization progress is shown below.

Iteration	Val Score (Avg)	Key Improvements
0 (Seed)	0.000	Initial "You are a Python programmer" prompt.
1	0.500	Adds explicit output format, function contract, and tokenization rules.
2	0.875	Refined rules for palindromes, two-sum logic, type hints, and docstrings.

Note

The optimization achieved a significant jump from a baseline score of 0.0 to 0.875 in just two iterations. It would probably go even higher, but I didn't run GEPA for longer to limit development cost.

@shreyaspimpalgaonkar

examples/optimizer/code_generation.py

src/bespokelabs/curator/blocks/gepa.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-06T21:29:38Z

examples/optimizer/code_generation_dataset.py

+            {"input": ([(1, 2), (3, 4), (5, 6)],), "expected": [(1, 2), (3, 4), (5, 6)]},
+            {"input": ([(1, 10), (2, 3), (4, 5)],), "expected": [(1, 10)]},
+            {"input": ([(5, 7), (1, 3), (2, 4)],), "expected": [(1, 4), (5, 7)]},
+        ],


JSON serialization loses tuple types in merge_intervals tests

Medium Severity

The merge_intervals test cases use tuples for interval pairs (e.g., expected: [(1, 6), (8, 10)]), but serialize_test_cases converts them via json.dumps, which turns all tuples into JSON arrays. When deserialize_test_cases uses json.loads, these become Python lists ([[1, 6], [8, 10]]). Since (1, 6) != [1, 6] in Python, the result == test["expected"] comparison in run_tests always fails for correct tuple-returning implementations, causing 5 of 6 merge_intervals test cases to always score zero.

Additional Locations (1)

examples/optimizer/code_generation_evaluation.py#L112-L114

cursor · 2026-02-06T21:29:38Z

src/bespokelabs/curator/llm/llm.py

            return_completions_object=self.return_completions_object,
        )
+        self.backend = backend
+        self.backend_params = backend_params


Mutated backend_params stored, polluting clone's state

Low Severity

self.backend_params = backend_params is assigned after _RequestProcessorFactory.create() mutates the backend_params dict in-place (adding model, generation_params, and return_completions_object keys). The clone method then deepcopy-ies this polluted dict. While currently harmless since the factory overwrites these keys, the stored backend_params doesn't reflect the user's original input, and any future change to the factory's mutation behavior could break clone.

Additional Locations (1)

src/bespokelabs/curator/llm/llm.py#L169-L170

agdhruv · 2026-02-09T19:47:08Z

@shreyaspimpalgaonkar thoughts?

agdhruv added 2 commits January 27, 2026 01:56

add gepa integration to optimize prompt for dataset generation

0445cbc

add gepa dependency

cafa1c2

cursor bot reviewed Jan 27, 2026

View reviewed changes

src/bespokelabs/curator/blocks/__init__.py Outdated Show resolved Hide resolved

agdhruv added 5 commits February 1, 2026 03:17

improves user-facing API for GEPA optimization

daafd0d

update GEPA example to satisfy new API

2bacd30

disable caching cuz curator doesn't use system prompt for the cache key

d3c74a4

add a stronger code generator experiment for GEPA

9204827

show optimization summary in code execution GEPA example

b4e8251

cursor bot reviewed Feb 1, 2026

View reviewed changes

examples/optimizer/code_generation.py Outdated Show resolved Hide resolved

src/bespokelabs/curator/blocks/gepa.py Show resolved Hide resolved

src/bespokelabs/curator/blocks/gepa.py Show resolved Hide resolved

fix cursor bugbot fixes

87b8c8f

cursor bot reviewed Feb 6, 2026

View reviewed changes

src/bespokelabs/curator/blocks/gepa.py Show resolved Hide resolved

optional import handling + gepa x curator cache handling

4c7e800

cursor bot reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate GEPA into Curator#709

Integrate GEPA into Curator#709
agdhruv wants to merge 9 commits intobespokelabsai:mainfrom
agdhruv:feature/gepa-ai

agdhruv commented Jan 27, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

shreyaspimpalgaonkar commented Jan 30, 2026

Uh oh!

agdhruv commented Feb 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 6, 2026

Uh oh!

cursor bot Feb 6, 2026

Uh oh!

agdhruv commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agdhruv commented Jan 27, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Example Workflow

Other Approaches Considered

Points for Discussion

Uh oh!

Uh oh!

shreyaspimpalgaonkar commented Jan 30, 2026

Uh oh!

agdhruv commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modifications from original PR

Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

JSON serialization loses tuple types in merge_intervals tests

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

Mutated backend_params stored, polluting clone's state

Uh oh!

agdhruv commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agdhruv commented Jan 27, 2026 •

edited by cursor bot

Loading

agdhruv commented Feb 1, 2026 •

edited

Loading