Skip to content

[Bug] MIPROv2 ignores metric_threshold for unshuffled few-shot bootstrap (seed=-1) #9308

@arutamonofu

Description

@arutamonofu

What happened?

When using MIPROv2 with metric_threshold parameter, the threshold is not applied to one of the bootstrap candidate sets (specifically the "unshuffled few-shot" case with seed == -1). This causes examples with metrics below the threshold to be incorrectly included as "full traces" during bootstrapping.

Steps to reproduce

  1. Configure MIPROv2 with metric_threshold=1.0
  2. Run optimization with training data where examples have metrics < 1.0 (e.g., 0.8)
  3. Observe that bootstrapping reports success even though no examples should pass the threshold

Example code:

from dspy.teleprompt import MIPROv2

teleprompter = MIPROv2(
    metric=your_metric,
    prompt_model=teacher_lm,
    task_model=student_lm,
    metric_threshold=1.0,  # ← Set threshold to 1.0
    num_candidates=10,
    max_bootstrapped_demos=4,
    max_labeled_demos=4,
    seed=42,
)

optimized_agent = teleprompter.compile(
    base_agent,
    trainset=trainset,
    valset=valset,
    num_trials=10,
)

Expected output:

Bootstrapped 0 full traces after X examples...

Actual output (when examples have metric=0.8):

Bootstrapped 2 full traces after 2 examples...

Root cause

In dspy/teleprompt/utils.py, the function create_n_fewshot_demo_sets() creates multiple candidate sets with different seeds:

  • seed = -3: zero-shot (no bootstrap)
  • seed = -2: labeled examples only
  • seed = -1: unshuffled bootstrap ← BUG: metric_threshold not passed
  • seed >= 0: shuffled bootstrap (metric_threshold is passed)

The bug is on lines ~382-391:

elif seed == -1:
    # unshuffled few-shot
    program = BootstrapFewShot(
        metric=metric,
        max_errors=max_errors,
        max_bootstrapped_demos=max_bootstrapped_demos,
        max_labeled_demos=max_labeled_demos,
        teacher_settings=teacher_settings,
        max_rounds=max_rounds,
        # ❌ metric_threshold is NOT passed here!
    )
    program2 = program.compile(student, teacher=teacher, trainset=trainset_copy)

Compare this to the correct implementation for seed >= 0 (lines ~394-403):

else:
    # shuffled few-shot
    rng.shuffle(trainset_copy)
    size = rng.randint(min_num_samples, max_bootstrapped_demos)

    teleprompter = BootstrapFewShot(
        metric=metric,
        max_errors=max_errors,
        metric_threshold=metric_threshold,  # ✓ Correctly passed
        max_bootstrapped_demos=size,
        max_labeled_demos=max_labeled_demos,
        teacher_settings=teacher_settings,
        max_rounds=max_rounds,
    )

Expected behavior

The metric_threshold parameter should be passed to all BootstrapFewShot instances, including the unshuffled case (seed == -1).

Suggested fix

Add metric_threshold=metric_threshold to the BootstrapFewShot initialization in the seed == -1 case:

elif seed == -1:
    # unshuffled few-shot
    teleprompter = BootstrapFewShot(
        metric=metric,
        max_errors=max_errors,
        metric_threshold=metric_threshold,  # ← ADD THIS LINE
        max_bootstrapped_demos=max_bootstrapped_demos,
        max_labeled_demos=max_labeled_demos,
        teacher_settings=teacher_settings,
        max_rounds=max_rounds,
    )
    program2 = teleprompter.compile(student, teacher=teacher, trainset=trainset_copy)

DSPy version

3.1.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions