Skip to content

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, depth>1 recursion, and GEPA compatibility #9289

@rawwerks

Description

@rawwerks

Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, and GEPA compatibility

Summary

RLM is powerful but currently limited to text-only inputs, has no cost/time guardrails, can't route to different sub-models, is restricted to the Deno sandbox, and crashes GEPA when it hits timeouts or cost limits. This PR set addresses all five gaps.

These changes came out of trying to use RLM + GEPA together on real multimodal tasks. Each fix addresses a failure mode I hit in practice.

Branch: rawwerks/dspy:feat/rlm-multimodal-media-support

The five changes

1. Multimodal media support — Audio/Image inputs for RLM

Problem: RLM can only work with text. If you have an Audio or Image input field, the sandbox just sees a descriptor string — the agent can't actually perceive the content.

Solution: Auto-detect Audio/Image input fields and expose llm_query_with_media(prompt, *media_var_names) in the sandbox. Media objects are held in a registry outside the sandbox; when the agent calls llm_query_with_media("describe this", "image"), the media content is attached as multimodal content parts to the sub-LLM call.

class DescribeImage(dspy.Signature):
    """Describe the contents of the image."""
    image: dspy.Image = dspy.InputField()
    description: str = dspy.OutputField()

rlm = dspy.RLM(DescribeImage)
result = rlm(image=dspy.Image.from_url("https://example.com/photo.jpg"))

The agent's sandbox code can then do:

text = llm_query_with_media("Describe what you see in detail", "image")
print(text)

2. Multi-model sub-call routing via sub_lms

Problem: Some tasks benefit from using different models for different sub-calls — e.g., a cheap model for initial exploration and a stronger model for verification. RLM only supports a single sub_lm.

Solution: New sub_lms parameter accepts a dict of named LM instances. Sandbox code selects a model with llm_query(prompt, model="name").

rlm = dspy.RLM(
    DescribeImage,
    sub_lm=lm_fast,            # default for most calls
    sub_lms={"strong": lm_pro},  # agent can opt in
)

Sandbox code:

draft = llm_query("quick summary of the data")          # uses fast default
verified = llm_query("verify this carefully", model="strong")  # uses pro

3. LocalInterpreter — unsandboxed host-process execution

Problem: The Deno/Pyodide sandbox can't access host Python packages like PIL, soundfile, numpy, scipy, etc. For tasks that combine LLM perception with direct data processing, the sandbox is too restrictive.

Solution: LocalInterpreter implements the CodeInterpreter protocol but executes via exec() in the host process. State persists across iterations, tools are injected into the namespace, and SUBMIT() works identically.

from dspy.primitives.local_interpreter import LocalInterpreter

rlm = dspy.RLM(MySignature, interpreter=LocalInterpreter())

Now the agent can import numpy, import PIL, etc. in its generated code.

Security note: This is intentionally unsandboxed. The docstring and class-level documentation make this explicit. It's for local experiments and trusted workloads where the sandbox is the bottleneck.

4. Budget awareness — budget(), max_time, max_cost

Problem: RLM agents have no way to know how much budget remains. They burn through iterations and LLM calls blindly, then either hit hard limits or run up unbounded costs. In optimization loops over many examples, a single runaway example can blow your API budget.

Solution: Three new parameters + a budget() tool:

  • max_time — wall-clock seconds per forward() call. Exceeded → extract fallback (not crash).
  • max_cost — dollar cost per forward() call, tracked via litellm's per-call cost reporting. Also supports BYOK providers where response_cost is 0 but usage.cost_details.upstream_inference_cost carries the real charge.
  • budget() — callable in the sandbox. Returns a human-readable summary of remaining iterations, LLM calls, time, and cost. Warns when any resource drops below 20%.
rlm = dspy.RLM(
    MySignature,
    max_time=120,      # 2 minutes per example
    max_cost=0.50,     # $0.50 per example
    max_llm_calls=25,
)

Sandbox code:

print(budget())
# → "Iterations: 8/12 remaining | LLM calls: 18/25 remaining | Time: 52.3s/120.0s remaining | Cost: $0.2100/$0.5000 remaining"

When time or cost is exceeded, forward() triggers the extract fallback (attempts to salvage a result from current state) rather than raising an exception. This is critical for optimization — a partial result with trace data is far more useful than a crash.

5. GEPA bootstrap_trace resilience for RLM failures

Problem: When RLM hits a timeout, interpreter crash, or cost overrun, it raises an exception that bootstrap_trace_data doesn't catch. The existing except only handles parse failures (AdapterParseError, ValueError). Any other exception type kills the training example entirely — the trace is lost and GEPA can't reflect on what the agent tried before failing.

Solution: Add a broad except Exception handler that preserves the partial trace:

except Exception as e:
    trace = dspy.settings.trace.copy()
    failed_pred = FailedPrediction(completion_text=str(e))
    return failed_pred, trace

This is a 6-line change but it's the difference between GEPA learning from failures vs. silently losing all trace data from budget-exceeded examples.

How they work together

import dspy
from dspy.primitives.local_interpreter import LocalInterpreter

lm_fast = dspy.LM("openrouter/google/gemini-3-flash-preview")
lm_pro = dspy.LM("openrouter/google/gemini-3-pro-preview")
dspy.configure(lm=lm_fast)

rlm = dspy.RLM(
    MyMultimodalSignature,
    sub_lms={"pro": lm_pro},        # multi-model routing
    max_time=120, max_cost=0.50,     # budget controls
    max_llm_calls=25,
    interpreter=LocalInterpreter(),  # full Python access
)

# GEPA can optimize this without crashing on budget-exceeded examples:
optimizer = dspy.GEPA(metric=my_metric, max_rounds=5)
optimized = optimizer.compile(rlm, trainset=examples)

The RLM agent can:

  1. Call budget() to plan its approach
  2. Use llm_query_with_media() for multimodal perception
  3. Use any installed Python package via LocalInterpreter
  4. Route to stronger models for high-stakes sub-calls
  5. Gracefully degrade via extract fallback when budget runs out
  6. Provide useful trace data to GEPA even on failures

Testing strategy

All new code is covered by offline tests (no API keys, no Deno required). The test suite runs in <1 second.

Unit tests verify each feature in isolation via MockInterpreter and mock LMs:

  • Media field detection, registry construction, reserved name validation
  • sub_lms routing: named model selection, fallback to default, unknown model errors
  • budget() output: iteration/call/time/cost reporting, low-resource warnings
  • max_time and max_cost triggering extract fallback (not crash)
  • BYOK cost tracking: usage.cost_details.upstream_inference_cost aggregation

Integration tests verify the features work together through RLM's forward() loop:

  • RLM + LocalInterpreter end-to-end: forward, state persistence across iterations, tool access (llm_query, budget), stdlib imports, error recovery, async aforward
  • llm_query_with_media content construction: verifies multimodal content parts (text + media) are correctly assembled and sent to the LM, with single/multiple media objects and model routing
  • max_cost mid-run fallback: cost injected into LM history after first iteration triggers extract fallback on the next iteration check
  • Async budget/time/cost: all three budget features work correctly through aforward()

bootstrap_trace resilience tests verify the GEPA compatibility fix:

  • RuntimeError from forward() is captured as FailedPrediction, not propagated
  • Partial trace is preserved when cost overrun kills execution after some work was done
  • KeyboardInterrupt (BaseException) is NOT caught — confirms the broad except Exception doesn't swallow signals

Total: 20 new tests across 6 test classes, plus 33 existing unit tests for the LocalInterpreter and media serialization modules.

Implementation notes

  • All changes are additive — no breaking changes to existing RLM behavior
  • Default values preserve current behavior (max_time=None, max_cost=None, sub_lms={}, default PythonInterpreter)
  • llm_query_with_media only appears in the sandbox prompt when media input fields are detected
  • All files pass ruff check and gitleaks detect

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions