Skip to content

Integrate GEPA into Curator#709

Open
agdhruv wants to merge 9 commits intobespokelabsai:mainfrom
agdhruv:feature/gepa-ai
Open

Integrate GEPA into Curator#709
agdhruv wants to merge 9 commits intobespokelabsai:mainfrom
agdhruv:feature/gepa-ai

Conversation

@agdhruv
Copy link

@agdhruv agdhruv commented Jan 27, 2026

This PR introduces an integration between Curator and GEPA to enable automated prompt optimization.

Key Changes

The PR implements a new "optimizer" block that allows users to refine their LLM prompts through an evolutionary search process rather than manual trial and error.

  • GEPA Integration: Adds gepa as an optional dependency in pyproject.toml.
  • CuratorAdapter Implementation: Provides a bridge (CuratorAdapter) that maps Curator's LLM classes to GEPA’s optimization loop. It handles:
    • Seed Extraction: Automatically pulls initial system prompts and templates from existing classes.
    • Evaluation: Runs candidate prompts against datasets and captures "trajectories" (inputs, outputs, and scores) for feedback.
    • Refinement: Uses GEPA's reflection mechanism to propose targeted prompt improvements based on score feedback.
  • Structured Evaluation: Defines EvaluationResult to pair numerical scores with natural language feedback, which helps the "Reflection LLM" understand how to improve the prompt.

Example Workflow

The PR includes a comprehensive example (gepa_example.py) demonstrating how to optimize a math word problem generator:

  1. Generator: A curator.LLM class with a basic "seed" prompt.
  2. Judge: A second LLM that acts as a grader, scoring clarity and correctness.
  3. Optimization: GEPA runs multiple iterations, evolving the generator's prompt until it reaches the highest possible score from the judge.

Other Approaches Considered

  • DSPy Integration: I initially attempted to use the version of GEPA included in DSPy. While functional, the integration felt clunky and introduced unnecessary indirection in the call chain (Curator LLM $\rightarrow$ DSPy LM $\rightarrow$ GEPA). This led to a misalignment where we were optimizing a DSPy-specific prompt rather than the native Curator prompt. So, I moved toward a direct GEPA implementation for better transparency and control.

Points for Discussion

  • Model Naming Conventions: The reflection_lm required by gepa.compile() follows GEPA's model naming schema, which may differ from Curator's. How do we deal with this?
  • Use Case Validation: While technically functional, the utility of prompt optimization for dataset generation is not super clear because this task is non-verifiable. The current math problems example is a proof-of-concept; I'm open to suggestions for more robust, real-world use cases where evolutionary optimization provides a clear advantage.
  • User Expertise Assumptions: The current implementation assumes the end-user is familiar with GEPA's architecture and hyperparameter tuning.

Given the design decisions outlined above, this implementation should be seen as a first step. My goal is to provide a functional foundation so the team can experiment with the workflow. This will help us determine the most intuitive API to expose to users in the final release.


Note

Medium Risk
Introduces a new execution path that repeatedly calls LLMs and temporarily forces CURATOR_DISABLE_CACHE, plus a new LLM.clone() construction path; incorrect cloning or cache toggling could cause subtle behavior differences or performance issues.

Overview
Adds an optional GEPA integration to Curator by introducing CuratorAdapter (curator.blocks.gepa) that lets GEPA optimize a Curator LLM’s system_prompt, run candidate prompts against a dataset, score via a user-provided metric, and optionally emit reflection trajectories.

Extends LLM with a clone() method (and stores backend/backend_params) so the adapter can create per-candidate LLM instances; also updates curator.blocks exports to be conditional when the optimizer extra (new optional gepa dependency) is not installed. Includes new examples/optimizer/* showing GEPA optimization for an LLM-judged math problem generator and for code generation scored by test execution + static style checks.

Written by Cursor Bugbot for commit 4c7e800. This will update automatically on new commits. Configure here.

@shreyaspimpalgaonkar
Copy link
Contributor

Thanks for the PR @agdhruv. Can you share some sample results here?

@agdhruv
Copy link
Author

agdhruv commented Feb 1, 2026

Modifications from original PR

  • Improved user-facing API: simply takes the LLM instance and clones it internally to optimize the system prompt.
  • Optimize only system prompt: we don't optimize the user prompt since that usually contains named variable components that GEPA might remove or rename.
  • Added a significant new example: see below.

Results

Added a new example examples/optimizer/code_generation.py, where I mimic generating a dataset of Python code for programming problems. For high-quality data, we want to meet certain constraints in the generated solutions (e.g., should pass test cases, only markdown blocks, requires docstring, requires type hints, no print statements, etc.). The example shows how to optimize the system prompt to meet those constraints.

The optimization progress is shown below.

Iteration Val Score (Avg) Key Improvements
0 (Seed) 0.000 Initial "You are a Python programmer" prompt.
1 0.500 Adds explicit output format, function contract, and tokenization rules.
2 0.875 Refined rules for palindromes, two-sum logic, type hints, and docstrings.

Note

The optimization achieved a significant jump from a baseline score of 0.0 to 0.875 in just two iterations. It would probably go even higher, but I didn't run GEPA for longer to limit development cost.

@shreyaspimpalgaonkar

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

{"input": ([(1, 2), (3, 4), (5, 6)],), "expected": [(1, 2), (3, 4), (5, 6)]},
{"input": ([(1, 10), (2, 3), (4, 5)],), "expected": [(1, 10)]},
{"input": ([(5, 7), (1, 3), (2, 4)],), "expected": [(1, 4), (5, 7)]},
],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON serialization loses tuple types in merge_intervals tests

Medium Severity

The merge_intervals test cases use tuples for interval pairs (e.g., expected: [(1, 6), (8, 10)]), but serialize_test_cases converts them via json.dumps, which turns all tuples into JSON arrays. When deserialize_test_cases uses json.loads, these become Python lists ([[1, 6], [8, 10]]). Since (1, 6) != [1, 6] in Python, the result == test["expected"] comparison in run_tests always fails for correct tuple-returning implementations, causing 5 of 6 merge_intervals test cases to always score zero.

Additional Locations (1)

Fix in Cursor Fix in Web

return_completions_object=self.return_completions_object,
)
self.backend = backend
self.backend_params = backend_params
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutated backend_params stored, polluting clone's state

Low Severity

self.backend_params = backend_params is assigned after _RequestProcessorFactory.create() mutates the backend_params dict in-place (adding model, generation_params, and return_completions_object keys). The clone method then deepcopy-ies this polluted dict. While currently harmless since the factory overwrites these keys, the stored backend_params doesn't reflect the user's original input, and any future change to the factory's mutation behavior could break clone.

Additional Locations (1)

Fix in Cursor Fix in Web

@agdhruv
Copy link
Author

agdhruv commented Feb 9, 2026

@shreyaspimpalgaonkar thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants