Paper Suggestion: AgentLeak
Hi! I'd like to suggest adding AgentLeak to the Benchmarks & Evaluation section (and potentially Security Agents) — it's the first full-stack benchmark for privacy leakage across multi-agent LLM pipelines.
📄 Paper
"AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems"
🔍 Why it fits
AgentLeak benchmarks AI security risks (privacy leakage) in autonomous multi-agent systems — directly relevant to this list's focus on AI for security and security agents:
- 7 leakage channels tracked simultaneously: tool calls, inter-agent messages, RAG queries, code execution, API calls, final outputs, reasoning traces
- 68.8% of privacy leakage occurs in inter-agent messages — invisible to output-only auditing
- 41.7% of leakage is missed by output-only evaluation (standard practice today)
- Tests across AutoGen, LangGraph, CrewAI frameworks
- Evaluates GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1 70B
- Directly addresses security risks of autonomous agents communicating with each other
This bridges the Benchmarks & Evaluation section (security evaluation of LLM agents) and Security Agents section (understanding vulnerabilities in multi-agent architectures).
Thanks for maintaining this resource!
Paper Suggestion: AgentLeak
Hi! I'd like to suggest adding AgentLeak to the Benchmarks & Evaluation section (and potentially Security Agents) — it's the first full-stack benchmark for privacy leakage across multi-agent LLM pipelines.
📄 Paper
"AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems"
🔍 Why it fits
AgentLeak benchmarks AI security risks (privacy leakage) in autonomous multi-agent systems — directly relevant to this list's focus on AI for security and security agents:
This bridges the Benchmarks & Evaluation section (security evaluation of LLM agents) and Security Agents section (understanding vulnerabilities in multi-agent architectures).
Thanks for maintaining this resource!