Skip to content

feat: gepa algorithm#502

Draft
lspinheiro wants to merge 8 commits intomicrosoft:mainfrom
lspinheiro:feat/lpinheiro/gepa-algorithm
Draft

feat: gepa algorithm#502
lspinheiro wants to merge 8 commits intomicrosoft:mainfrom
lspinheiro:feat/lpinheiro/gepa-algorithm

Conversation

@lspinheiro
Copy link
Contributor

@lspinheiro lspinheiro commented Mar 10, 2026

Description

Agent Lightning already supports weight-level training (VERL) and beam-search prompt optimization (APO). This PR adds GEPA — an evolutionary prompt optimizer that fills an important gap: fast, inference-only prompt improvement that tracks per-example performance via a Pareto frontier.

Unlike APO's beam search, GEPA evolves prompt candidates through reflective mutations: it examines execution traces, identifies where prompts fall short on specific examples, and proposes targeted improvements. This makes it particularly effective for tasks with diverse failure modes, where a single "best prompt" metric can hide regressions on individual cases.

What's included

  1. Algorithm integration (agentlightning/algorithm/gepa/) — GEPA plugs into the standard Trainer workflow just like APO and VERL. It bridges GEPA's synchronous
    optimizer with AGL's async store, converts between AGL resources and GEPA candidates, and supports W&B experiment tracking out of the box.
  2. Room-booking example (examples/gepa/) — a complete walkthrough: a tool-calling agent picks meeting rooms, an LLM judge scores the choices, and GEPA
    optimizes the prompt across 57 scenarios. The example now supports Azure Entra ID, Azure API key, and plain OpenAI as backends (--provider flag or LLM_PROVIDER
    env var), so contributors without Azure access can run it with just an OpenAI key.
  3. Test suite (tests/algorithm/gepa/) — covers config, resource codec, adapter, interface lifecycle, and callbacks.

How GEPA compares to the existing algorithms

GEPA is a good fit when you want to improve prompts without touching model weights, and you care about not regressing on specific inputs while improving overall performance.

Example W&B logs

agent-lightning/examples/gepa$ uv run python room_selector_gepa.py --wandb
GEPA Optimization:  86%|███████████████████████████████████████████████████████████████████████████████████▍             | 215/250 [08:58<01:27,  2.51s/rollouts]
wandb: 
wandb: Run history:
wandb:      agl/best_candidate_valset_score ▁▁▁██████
wandb:                   agl/num_candidates ▁▁▁▃▃▃▃▆█
wandb:                 agl/pareto_front_agg ▁▁▁▆▆▆▆▇█
wandb:               agl/total_metric_calls ▁▂▂▄▄▅▅▇█
wandb:       base_program_full_valset_score ▁
wandb:            base_program_val_coverage ▁
wandb: best_program_as_per_agg_score_valset ▁▁▁
wandb:                 best_score_on_valset ▁▁▁
wandb:                best_valset_agg_score ▁▁▁
wandb:                            iteration ▁▂▃▃▄▅▆▆▇█
wandb:                                  +10 ...
wandb: 
wandb: Run summary:
wandb:      agl/best_candidate_valset_score 0.39655
wandb:                   agl/num_candidates 4
wandb:                 agl/pareto_front_agg 0.55172
wandb:                agl/proposal_accepted True
wandb:               agl/total_metric_calls 260
wandb:       base_program_full_valset_score 0.17759
wandb:            base_program_val_coverage 29
wandb: best_program_as_per_agg_score_valset 1
wandb:                 best_score_on_valset 0.39655
wandb:                best_valset_agg_score 0.39655
wandb:                                  +12 ...
wandb: 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant