Quickstart |
Pre-Generated Safety Experiences |
Generate Your Own Safety Experiences |
Extend to New Tools |
Safety Benchmark |
Citation
ToolShield is a training-free, tool-agnostic defense for AI agents that use MCP tools. Just pip install toolshield and a single command guards your coding agent with safety experiences — no API keys, no sandbox setup, no fine-tuning. Reduces attack success rate by 30% on average.
pip install toolshieldWe ship safety experiences for 6 models across 5 tools, with plug-and-play support for 5 coding agents:
Inject them in one command — no need to know where files are installed:
# For Claude Code (filesystem example)
toolshield import --exp-file filesystem-mcp.json --agent claude_code
# For Codex (postgres example)
toolshield import --exp-file postgres-mcp.json --agent codex
# For OpenClaw (terminal example)
toolshield import --exp-file terminal-mcp.json --agent openclaw
# For Cursor (playwright example)
toolshield import --exp-file playwright-mcp.json --agent cursor
# For OpenHands (notion example)
toolshield import --exp-file notion-mcp.json --agent openhandsUse experiences from a different model with --model:
toolshield import --exp-file filesystem-mcp.json --model gpt-5.2 --agent claude_codeOr import all bundled experiences (all 5 tools) in one shot:
toolshield import --all --agent claude_codeYou can also import multiple experience files individually:
toolshield import --exp-file filesystem-mcp.json --agent claude_code
toolshield import --exp-file terminal-mcp.json --agent claude_code
toolshield import --exp-file postgres-mcp.json --agent claude_codeSee all available bundled experiences:
toolshield listThis appends safety guidelines to your agent's context file (~/.claude/CLAUDE.md, ~/.codex/AGENTS.md, ~/.openclaw/workspace/AGENTS.md, Cursor's global user rules, or ~/.openhands/microagents/toolshield.md). To remove them:
toolshield unload --agent claude_codeAvailable bundled experiences (run toolshield list to see all):
| Model | |||||
|---|---|---|---|---|---|
claude-sonnet-4.5 |
✅ | ✅ | ✅ | ✅ | ✅ |
gpt-5.2 |
✅ | ✅ | ✅ | ✅ | ✅ |
deepseek-v3.2 |
✅ | ✅ | ✅ | ✅ | ✅ |
gemini-3-flash-preview |
✅ | ✅ | ✅ | ✅ | ✅ |
qwen3-coder-plus |
✅ | ✅ | ✅ | ✅ | ✅ |
seed-1.6 |
✅ | ✅ | ✅ | ✅ | ✅ |
More plug-and-play experiences for additional tools coming soon (including Toolathlon support)! Have a tool you'd like covered? Open an issue.
Point ToolShield at any running MCP server to generate custom safety experiences:
export TOOLSHIELD_MODEL_NAME="anthropic/claude-sonnet-4.5"
export OPENROUTER_API_KEY="your-key"
# Full pipeline: inspect → generate safety tree → test → distill → inject
toolshield \
--mcp_name postgres \
--mcp_server http://localhost:9091 \
--output_path output/postgres \
--agent codexOr generate without injecting (useful for review):
toolshield generate \
--mcp_name postgres \
--mcp_server http://localhost:9091 \
--output_path output/postgresAutomatically scan localhost for running MCP servers, run the full pipeline for each, and inject the results:
toolshield auto --agent codexThis scans ports 8000-10000 by default (configurable with --start-port / --end-port).
ToolShield works with any MCP server that has an SSE endpoint:
toolshield generate \
--mcp_name your_custom_tool \
--mcp_server http://localhost:PORT \
--output_path output/your_custom_toolWe also release MT-AgentRisk, a benchmark of 365 harmful tasks across 5 MCP tools, transformed into multi-turn attack sequences. See agentrisk/README.md for full evaluation setup.
Quick evaluation:
# 1. Download benchmark tasks
git clone https://huggingface.co/datasets/CHATS-Lab/MT-AgentRisk
cp -r MT-AgentRisk/workspaces/* workspaces/
# 2. Run a single task (requires OpenHands setup — see agentrisk/README.md)
python agentrisk/run_eval.py \
--task-path workspaces/terminal/multi_turn_tasks/multi-turn_root-remove \
--agent-llm-config agent \
--env-llm-config env \
--outputs-path output/eval \
--server-hostname localhostAdd --use-experience <path> to evaluate with ToolShield defense.
ToolShield/
├── toolshield/ # pip-installable defense package
│ └── experiences/ # bundled safety experiences (6 models × 5 tools)
├── agentrisk/ # evaluation framework (see agentrisk/README.md)
├── workspaces/ # MT-AgentRisk task data (from HuggingFace)
├── docker/ # Dockerfiles and compose
└── scripts/ # experiment reproduction guides
We thank the authors of the following projects for their contributions:
@misc{li2026unsaferturnsbenchmarkingdefending,
title={Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents},
author={Xu Li and Simon Yu and Minzhou Pan and Yiyou Sun and Bo Li and Dawn Song and Xue Lin and Weiyan Shi},
year={2026},
eprint={2602.13379},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2602.13379},
}MIT
