Skip to content

[Paper Suggestion] AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems — fits Benchmarks & Evaluation + Security Agents #8

@yagobski

Description

@yagobski

Paper Suggestion: AgentLeak

Hi! I'd like to suggest adding AgentLeak to the Benchmarks & Evaluation section (and potentially Security Agents) — it's the first full-stack benchmark for privacy leakage across multi-agent LLM pipelines.

📄 Paper

"AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems"

🔍 Why it fits

AgentLeak benchmarks AI security risks (privacy leakage) in autonomous multi-agent systems — directly relevant to this list's focus on AI for security and security agents:

  • 7 leakage channels tracked simultaneously: tool calls, inter-agent messages, RAG queries, code execution, API calls, final outputs, reasoning traces
  • 68.8% of privacy leakage occurs in inter-agent messages — invisible to output-only auditing
  • 41.7% of leakage is missed by output-only evaluation (standard practice today)
  • Tests across AutoGen, LangGraph, CrewAI frameworks
  • Evaluates GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1 70B
  • Directly addresses security risks of autonomous agents communicating with each other

This bridges the Benchmarks & Evaluation section (security evaluation of LLM agents) and Security Agents section (understanding vulnerabilities in multi-agent architectures).

Thanks for maintaining this resource!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions