Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility

# Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, and GEPA compatibility

## Summary

RLM is powerful but currently limited to text-only inputs, has no cost/time guardrails, can't route to different sub-models, is restricted to the Deno sandbox, and crashes GEPA when it hits timeouts or cost limits. This PR set addresses all five gaps.

These changes came out of trying to use RLM + GEPA together on real multimodal tasks. Each fix addresses a failure mode I hit in practice.

**Branch:** [`rawwerks/dspy:feat/rlm-multimodal-media-support`](https://github.com/rawwerks/dspy/tree/feat/rlm-multimodal-media-support)

## The five changes

### 1. Multimodal media support — Audio/Image inputs for RLM

**Problem:** RLM can only work with text. If you have an `Audio` or `Image` input field, the sandbox just sees a descriptor string — the agent can't actually perceive the content.

**Solution:** Auto-detect `Audio`/`Image` input fields and expose `llm_query_with_media(prompt, *media_var_names)` in the sandbox. Media objects are held in a registry outside the sandbox; when the agent calls `llm_query_with_media("describe this", "image")`, the media content is attached as multimodal content parts to the sub-LLM call.

```python
class DescribeImage(dspy.Signature):
    """Describe the contents of the image."""
    image: dspy.Image = dspy.InputField()
    description: str = dspy.OutputField()

rlm = dspy.RLM(DescribeImage)
result = rlm(image=dspy.Image.from_url("https://example.com/photo.jpg"))
```

The agent's sandbox code can then do:
```python
text = llm_query_with_media("Describe what you see in detail", "image")
print(text)
```

### 2. Multi-model sub-call routing via `sub_lms`

**Problem:** Some tasks benefit from using different models for different sub-calls — e.g., a cheap model for initial exploration and a stronger model for verification. RLM only supports a single `sub_lm`.

**Solution:** New `sub_lms` parameter accepts a dict of named LM instances. Sandbox code selects a model with `llm_query(prompt, model="name")`.

```python
rlm = dspy.RLM(
    DescribeImage,
    sub_lm=lm_fast,            # default for most calls
    sub_lms={"strong": lm_pro},  # agent can opt in
)
```

Sandbox code:
```python
draft = llm_query("quick summary of the data")          # uses fast default
verified = llm_query("verify this carefully", model="strong")  # uses pro
```

### 3. LocalInterpreter — unsandboxed host-process execution

**Problem:** The Deno/Pyodide sandbox can't access host Python packages like PIL, soundfile, numpy, scipy, etc. For tasks that combine LLM perception with direct data processing, the sandbox is too restrictive.

**Solution:** `LocalInterpreter` implements the `CodeInterpreter` protocol but executes via `exec()` in the host process. State persists across iterations, tools are injected into the namespace, and `SUBMIT()` works identically.

```python
from dspy.primitives.local_interpreter import LocalInterpreter

rlm = dspy.RLM(MySignature, interpreter=LocalInterpreter())
```

Now the agent can `import numpy`, `import PIL`, etc. in its generated code.

**Security note:** This is intentionally unsandboxed. The docstring and class-level documentation make this explicit. It's for local experiments and trusted workloads where the sandbox is the bottleneck.

### 4. Budget awareness — `budget()`, `max_time`, `max_cost`

**Problem:** RLM agents have no way to know how much budget remains. They burn through iterations and LLM calls blindly, then either hit hard limits or run up unbounded costs. In optimization loops over many examples, a single runaway example can blow your API budget.

**Solution:** Three new parameters + a `budget()` tool:

- **`max_time`** — wall-clock seconds per `forward()` call. Exceeded → extract fallback (not crash).
- **`max_cost`** — dollar cost per `forward()` call, tracked via litellm's per-call cost reporting. Also supports BYOK providers where `response_cost` is 0 but `usage.cost_details.upstream_inference_cost` carries the real charge.
- **`budget()`** — callable in the sandbox. Returns a human-readable summary of remaining iterations, LLM calls, time, and cost. Warns when any resource drops below 20%.

```python
rlm = dspy.RLM(
    MySignature,
    max_time=120,      # 2 minutes per example
    max_cost=0.50,     # $0.50 per example
    max_llm_calls=25,
)
```

Sandbox code:
```python
print(budget())
# → "Iterations: 8/12 remaining | LLM calls: 18/25 remaining | Time: 52.3s/120.0s remaining | Cost: $0.2100/$0.5000 remaining"
```

When time or cost is exceeded, `forward()` triggers the extract fallback (attempts to salvage a result from current state) rather than raising an exception. This is critical for optimization — a partial result with trace data is far more useful than a crash.

### 5. GEPA bootstrap_trace resilience for RLM failures

**Problem:** When RLM hits a timeout, interpreter crash, or cost overrun, it raises an exception that `bootstrap_trace_data` doesn't catch. The existing `except` only handles parse failures (`AdapterParseError`, `ValueError`). Any other exception type kills the training example entirely — the trace is lost and GEPA can't reflect on what the agent tried before failing.

**Solution:** Add a broad `except Exception` handler that preserves the partial trace:

```python
except Exception as e:
    trace = dspy.settings.trace.copy()
    failed_pred = FailedPrediction(completion_text=str(e))
    return failed_pred, trace
```

This is a 6-line change but it's the difference between GEPA learning from failures vs. silently losing all trace data from budget-exceeded examples.

## How they work together

```python
import dspy
from dspy.primitives.local_interpreter import LocalInterpreter

lm_fast = dspy.LM("openrouter/google/gemini-3-flash-preview")
lm_pro = dspy.LM("openrouter/google/gemini-3-pro-preview")
dspy.configure(lm=lm_fast)

rlm = dspy.RLM(
    MyMultimodalSignature,
    sub_lms={"pro": lm_pro},        # multi-model routing
    max_time=120, max_cost=0.50,     # budget controls
    max_llm_calls=25,
    interpreter=LocalInterpreter(),  # full Python access
)

# GEPA can optimize this without crashing on budget-exceeded examples:
optimizer = dspy.GEPA(metric=my_metric, max_rounds=5)
optimized = optimizer.compile(rlm, trainset=examples)
```

The RLM agent can:
1. Call `budget()` to plan its approach
2. Use `llm_query_with_media()` for multimodal perception
3. Use any installed Python package via LocalInterpreter
4. Route to stronger models for high-stakes sub-calls
5. Gracefully degrade via extract fallback when budget runs out
6. Provide useful trace data to GEPA even on failures

## Testing strategy

All new code is covered by offline tests (no API keys, no Deno required). The test suite runs in <1 second.

**Unit tests** verify each feature in isolation via `MockInterpreter` and mock LMs:
- Media field detection, registry construction, reserved name validation
- `sub_lms` routing: named model selection, fallback to default, unknown model errors
- `budget()` output: iteration/call/time/cost reporting, low-resource warnings
- `max_time` and `max_cost` triggering extract fallback (not crash)
- BYOK cost tracking: `usage.cost_details.upstream_inference_cost` aggregation

**Integration tests** verify the features work together through RLM's `forward()` loop:
- RLM + LocalInterpreter end-to-end: forward, state persistence across iterations, tool access (llm_query, budget), stdlib imports, error recovery, async aforward
- `llm_query_with_media` content construction: verifies multimodal content parts (text + media) are correctly assembled and sent to the LM, with single/multiple media objects and model routing
- `max_cost` mid-run fallback: cost injected into LM history after first iteration triggers extract fallback on the next iteration check
- Async budget/time/cost: all three budget features work correctly through `aforward()`

**bootstrap_trace resilience tests** verify the GEPA compatibility fix:
- `RuntimeError` from `forward()` is captured as `FailedPrediction`, not propagated
- Partial trace is preserved when cost overrun kills execution after some work was done
- `KeyboardInterrupt` (BaseException) is NOT caught — confirms the broad `except Exception` doesn't swallow signals

Total: **20 new tests** across 6 test classes, plus 33 existing unit tests for the LocalInterpreter and media serialization modules.

## Implementation notes

- All changes are additive — no breaking changes to existing RLM behavior
- Default values preserve current behavior (`max_time=None`, `max_cost=None`, `sub_lms={}`, default `PythonInterpreter`)
- `llm_query_with_media` only appears in the sandbox prompt when media input fields are detected
- All files pass `ruff check` and `gitleaks detect`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility #9289

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, and GEPA compatibility

Summary

The five changes

1. Multimodal media support — Audio/Image inputs for RLM

2. Multi-model sub-call routing via `sub_lms`

3. LocalInterpreter — unsandboxed host-process execution

4. Budget awareness — `budget()`, `max_time`, `max_cost`

5. GEPA bootstrap_trace resilience for RLM failures

How they work together

Testing strategy

Implementation notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility #9289

Description

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, and GEPA compatibility

Summary

The five changes

1. Multimodal media support — Audio/Image inputs for RLM

2. Multi-model sub-call routing via sub_lms

3. LocalInterpreter — unsandboxed host-process execution

4. Budget awareness — budget(), max_time, max_cost

5. GEPA bootstrap_trace resilience for RLM failures

How they work together

Testing strategy

Implementation notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Multi-model sub-call routing via `sub_lms`

4. Budget awareness — `budget()`, `max_time`, `max_cost`