-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Making RLM production-ready: multimodal media, budget controls, multi-model routing, LocalInterpreter, and GEPA compatibility
Summary
RLM is powerful but currently limited to text-only inputs, has no cost/time guardrails, can't route to different sub-models, is restricted to the Deno sandbox, and crashes GEPA when it hits timeouts or cost limits. This PR set addresses all five gaps.
These changes came out of trying to use RLM + GEPA together on real multimodal tasks. Each fix addresses a failure mode I hit in practice.
Branch: rawwerks/dspy:feat/rlm-multimodal-media-support
The five changes
1. Multimodal media support — Audio/Image inputs for RLM
Problem: RLM can only work with text. If you have an Audio or Image input field, the sandbox just sees a descriptor string — the agent can't actually perceive the content.
Solution: Auto-detect Audio/Image input fields and expose llm_query_with_media(prompt, *media_var_names) in the sandbox. Media objects are held in a registry outside the sandbox; when the agent calls llm_query_with_media("describe this", "image"), the media content is attached as multimodal content parts to the sub-LLM call.
class DescribeImage(dspy.Signature):
"""Describe the contents of the image."""
image: dspy.Image = dspy.InputField()
description: str = dspy.OutputField()
rlm = dspy.RLM(DescribeImage)
result = rlm(image=dspy.Image.from_url("https://example.com/photo.jpg"))The agent's sandbox code can then do:
text = llm_query_with_media("Describe what you see in detail", "image")
print(text)2. Multi-model sub-call routing via sub_lms
Problem: Some tasks benefit from using different models for different sub-calls — e.g., a cheap model for initial exploration and a stronger model for verification. RLM only supports a single sub_lm.
Solution: New sub_lms parameter accepts a dict of named LM instances. Sandbox code selects a model with llm_query(prompt, model="name").
rlm = dspy.RLM(
DescribeImage,
sub_lm=lm_fast, # default for most calls
sub_lms={"strong": lm_pro}, # agent can opt in
)Sandbox code:
draft = llm_query("quick summary of the data") # uses fast default
verified = llm_query("verify this carefully", model="strong") # uses pro3. LocalInterpreter — unsandboxed host-process execution
Problem: The Deno/Pyodide sandbox can't access host Python packages like PIL, soundfile, numpy, scipy, etc. For tasks that combine LLM perception with direct data processing, the sandbox is too restrictive.
Solution: LocalInterpreter implements the CodeInterpreter protocol but executes via exec() in the host process. State persists across iterations, tools are injected into the namespace, and SUBMIT() works identically.
from dspy.primitives.local_interpreter import LocalInterpreter
rlm = dspy.RLM(MySignature, interpreter=LocalInterpreter())Now the agent can import numpy, import PIL, etc. in its generated code.
Security note: This is intentionally unsandboxed. The docstring and class-level documentation make this explicit. It's for local experiments and trusted workloads where the sandbox is the bottleneck.
4. Budget awareness — budget(), max_time, max_cost
Problem: RLM agents have no way to know how much budget remains. They burn through iterations and LLM calls blindly, then either hit hard limits or run up unbounded costs. In optimization loops over many examples, a single runaway example can blow your API budget.
Solution: Three new parameters + a budget() tool:
max_time— wall-clock seconds perforward()call. Exceeded → extract fallback (not crash).max_cost— dollar cost perforward()call, tracked via litellm's per-call cost reporting. Also supports BYOK providers whereresponse_costis 0 butusage.cost_details.upstream_inference_costcarries the real charge.budget()— callable in the sandbox. Returns a human-readable summary of remaining iterations, LLM calls, time, and cost. Warns when any resource drops below 20%.
rlm = dspy.RLM(
MySignature,
max_time=120, # 2 minutes per example
max_cost=0.50, # $0.50 per example
max_llm_calls=25,
)Sandbox code:
print(budget())
# → "Iterations: 8/12 remaining | LLM calls: 18/25 remaining | Time: 52.3s/120.0s remaining | Cost: $0.2100/$0.5000 remaining"When time or cost is exceeded, forward() triggers the extract fallback (attempts to salvage a result from current state) rather than raising an exception. This is critical for optimization — a partial result with trace data is far more useful than a crash.
5. GEPA bootstrap_trace resilience for RLM failures
Problem: When RLM hits a timeout, interpreter crash, or cost overrun, it raises an exception that bootstrap_trace_data doesn't catch. The existing except only handles parse failures (AdapterParseError, ValueError). Any other exception type kills the training example entirely — the trace is lost and GEPA can't reflect on what the agent tried before failing.
Solution: Add a broad except Exception handler that preserves the partial trace:
except Exception as e:
trace = dspy.settings.trace.copy()
failed_pred = FailedPrediction(completion_text=str(e))
return failed_pred, traceThis is a 6-line change but it's the difference between GEPA learning from failures vs. silently losing all trace data from budget-exceeded examples.
How they work together
import dspy
from dspy.primitives.local_interpreter import LocalInterpreter
lm_fast = dspy.LM("openrouter/google/gemini-3-flash-preview")
lm_pro = dspy.LM("openrouter/google/gemini-3-pro-preview")
dspy.configure(lm=lm_fast)
rlm = dspy.RLM(
MyMultimodalSignature,
sub_lms={"pro": lm_pro}, # multi-model routing
max_time=120, max_cost=0.50, # budget controls
max_llm_calls=25,
interpreter=LocalInterpreter(), # full Python access
)
# GEPA can optimize this without crashing on budget-exceeded examples:
optimizer = dspy.GEPA(metric=my_metric, max_rounds=5)
optimized = optimizer.compile(rlm, trainset=examples)The RLM agent can:
- Call
budget()to plan its approach - Use
llm_query_with_media()for multimodal perception - Use any installed Python package via LocalInterpreter
- Route to stronger models for high-stakes sub-calls
- Gracefully degrade via extract fallback when budget runs out
- Provide useful trace data to GEPA even on failures
Testing strategy
All new code is covered by offline tests (no API keys, no Deno required). The test suite runs in <1 second.
Unit tests verify each feature in isolation via MockInterpreter and mock LMs:
- Media field detection, registry construction, reserved name validation
sub_lmsrouting: named model selection, fallback to default, unknown model errorsbudget()output: iteration/call/time/cost reporting, low-resource warningsmax_timeandmax_costtriggering extract fallback (not crash)- BYOK cost tracking:
usage.cost_details.upstream_inference_costaggregation
Integration tests verify the features work together through RLM's forward() loop:
- RLM + LocalInterpreter end-to-end: forward, state persistence across iterations, tool access (llm_query, budget), stdlib imports, error recovery, async aforward
llm_query_with_mediacontent construction: verifies multimodal content parts (text + media) are correctly assembled and sent to the LM, with single/multiple media objects and model routingmax_costmid-run fallback: cost injected into LM history after first iteration triggers extract fallback on the next iteration check- Async budget/time/cost: all three budget features work correctly through
aforward()
bootstrap_trace resilience tests verify the GEPA compatibility fix:
RuntimeErrorfromforward()is captured asFailedPrediction, not propagated- Partial trace is preserved when cost overrun kills execution after some work was done
KeyboardInterrupt(BaseException) is NOT caught — confirms the broadexcept Exceptiondoesn't swallow signals
Total: 20 new tests across 6 test classes, plus 33 existing unit tests for the LocalInterpreter and media serialization modules.
Implementation notes
- All changes are additive — no breaking changes to existing RLM behavior
- Default values preserve current behavior (
max_time=None,max_cost=None,sub_lms={}, defaultPythonInterpreter) llm_query_with_mediaonly appears in the sandbox prompt when media input fields are detected- All files pass
ruff checkandgitleaks detect