Source: Google TTD-DR Paper, OptILLM implementation, AI-Engineering-101 tutorial Evidence Tier: B (Community implementations with production validation)
The Self-Evolution Algorithm treats complex tasks as a diffusion process with iterative refinement. Instead of single-pass generation, it:
- Spawns multiple candidates with diverse configurations
- Iteratively refines each through judge feedback
- Synthesizes best elements through crossover
SDD Phase: Cross-phase (enhances Tasks and Implement phases)
"Self-evolution uses a 'Judge' to critique components (Plan, Search, Answer) and loop until quality improves." — Google Deep Researcher
| Scenario | Why Self-Evolution Helps |
|---|---|
| Complex research synthesis | Multiple perspectives, iterative depth |
| High-stakes decisions | Reduced single-point-of-failure risk |
| Creative problem-solving | Diversity sampling explores solution space |
| Quality-critical output | Judge loop ensures threshold met |
| Scenario | Why Not |
|---|---|
| Simple factual lookups | Overhead exceeds benefit |
| Time-critical responses | Parallel candidates add latency |
| Low-stakes tasks | Single-pass sufficient |
| Token-constrained contexts | 3x+ token usage |
Spawn N candidates (typically 3) with varying configurations to explore different solution spaces:
| Candidate | Style | Focus |
|---|---|---|
| Conservative | Evidence-heavy, low-risk | Established patterns, proven approaches |
| Balanced | Pragmatic, trade-off aware | Best practices, practical solutions |
| Creative | Exploratory, novel angles | Innovation potential, alternative perspectives |
Implementation Principle: Different prompt styles, not model parameters (Claude Code doesn't expose temperature).
For each candidate, iterate M times (typically 3):
┌─────────────────────────────────────────────────────┐
│ REFINEMENT LOOP │
├─────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GENERATE │ ──► │ EVALUATE │ ──► │ REVISE │ │
│ │ O_n │ │ (Judge) │ │ O_{n+1} │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ └───────────── LOOP ◄──────────────┘ │
│ │
│ Terminate when: │
│ - Quality > 0.9 (high confidence) │
│ - Improvement < 0.03 with Quality > 0.7 │
│ - Max iterations reached │
│ │
└─────────────────────────────────────────────────────┘
The Judge evaluates each output against criteria:
| Dimension | Question |
|---|---|
| Completeness | Does it address all aspects of the query? |
| Accuracy | Are claims supported by evidence? |
| Depth | Is analysis sufficiently comprehensive? |
| Coherence | Is there logical flow and organization? |
The Judge provides:
- Score (0.0-1.0): Normalized quality measure
- Feedback: Specific critique for improvement
- Issues: Top 3 problems to address
After all candidates refined:
- Identify Strengths: What unique value does each candidate provide?
- Note Agreement: Where candidates agree = high confidence
- Reconcile Conflicts: Favor higher-scored candidates
- Merge: Produce superior output combining best elements
Spawn 3 candidates simultaneously using the Task tool:
# In your prompt or skill:
## Phase 1: Multi-Candidate Initialization
Launch 3 subagents IN PARALLEL with a single message containing
multiple Task tool calls:
[Task 1: Conservative Candidate]
- subagent_type: "general-purpose"
- prompt: "Analyze [TOPIC] with a CONSERVATIVE approach:
- Focus on established patterns and proven solutions
- Prioritize evidence-based claims
- Favor low-risk recommendations
- Cite authoritative sources"
[Task 2: Balanced Candidate]
- subagent_type: "general-purpose"
- prompt: "Analyze [TOPIC] with a BALANCED approach:
- Consider trade-offs explicitly
- Apply best practices pragmatically
- Acknowledge alternatives
- Provide actionable recommendations"
[Task 3: Creative Candidate]
- subagent_type: "general-purpose"
- prompt: "Analyze [TOPIC] with a CREATIVE approach:
- Explore novel angles and alternatives
- Challenge assumptions
- Consider innovative solutions
- Think beyond conventional patterns"For repeatable methodology, create a recursive-analyst skill that:
- Spawns parallel candidates via Task tool
- Applies Judge criteria to each result
- Synthesizes through crossover prompt
See: skills/examples/recursive-analyst/SKILL.md
From OptILLM production implementation:
| Condition | When to Stop |
|---|---|
| Quality Threshold | Completeness > 0.9 |
| Diminishing Returns | Improvement < 0.03 AND Completeness > 0.7 |
| Max Iterations | Reached configured limit (default: 3) |
Principle: Stop iterating when additional refinement provides minimal benefit.
The OptILLM implementation scores across 6 dimensions:
- Completeness: Addresses all query aspects
- Accuracy: Information reliability
- Depth: Analysis comprehensiveness
- Coherence: Logical organization
- Citations: Source integration quality
- Improvement: Iteration-to-iteration gains
These map to the repo's existing confidence scoring pattern.
| Pattern | Integration |
|---|---|
| Subagent Orchestration | Extends with Diversity Sampling pattern |
| Confidence Scoring | Judge output maps to HIGH/MEDIUM/LOW |
| UltraThink Analyst | Can trigger recursive-analyst for deep analysis |
| Hypothesis Validator | Crossover synthesis can validate competing claims |
User: "Analyze whether we should migrate from PostgreSQL to DuckDB for analytics"
Recursive Evolution Process:
-
Spawn 3 Candidates (parallel):
- Conservative: Focus on migration risks, established patterns
- Balanced: Compare trade-offs, total cost analysis
- Creative: Explore hybrid architectures, novel approaches
-
Refine Each (3 iterations):
- Judge evaluates completeness, accuracy, depth
- Candidates incorporate feedback
- Track scores for crossover weighting
-
Crossover Synthesis:
- Conservative contributed: Risk matrix, rollback plan
- Balanced contributed: Decision framework, timeline
- Creative contributed: Hybrid Postgres+DuckDB architecture
- Merged: Comprehensive recommendation with risk-aware implementation plan
DON'T:
- Run self-evolution for simple tasks (overhead exceeds benefit)
- Skip the Judge step (refinement without feedback is just repetition)
- Use identical prompts for candidates (defeats diversity purpose)
- Ignore crossover synthesis (parallel work wasted without merge)
DO:
- Reserve for complex, high-stakes analysis
- Ensure Judge criteria align with task goals
- Differentiate candidate approaches meaningfully
- Weight crossover by candidate scores
Self-Evolution uses significantly more tokens:
| Approach | Token Multiplier |
|---|---|
| Single-pass | 1x |
| 3 candidates (no refinement) | ~3x |
| 3 candidates × 3 iterations | ~9x |
| + Judge evaluations | ~12x |
| + Crossover synthesis | ~13x |
Recommendation: Use for quality-critical tasks where the improvement justifies the cost.
- Subagent Orchestration - Parallel execution patterns
- Confidence Scoring - Quality assessment framework
- Long-Running Agent - External artifacts for context bridging
- UltraThink Analyst - Deep analysis methodology
- Google TTD-DR Paper - Test-Time Diffusion Deep Researcher
- OptILLM deep_research - Production implementation
- AI-Engineering-101 - Educational implementation
- Google Self-Evolution Algorithm Video - Saurav Prateek tutorial
Last updated: January 2026