Skip to content

comet-ml/adversarial-benchmark-agent

Repository files navigation

The Babadook Observability Test

Adversarial Benchmark Agent for LLM observability instrumentation tools.

Babadook is a functional multi-step research agent that is intentionally structured to defeat auto-instrumentation. It uses real frameworks and makes real LLM calls, but every integration is hidden behind layers of indirection.

What It Does

Takes a query, searches for information (via TypeScript embeddings), summarizes findings (via LangChain), optionally runs a multi-step reasoning flow (via LangGraph), optionally fact-checks (via CrewAI), and synthesizes a final answer. It can also use Google ADK for grounded search.

Anti-Instrumentation Patterns

# Pattern Where Why It's Hard
1 Dynamic imports (__import__, importlib) All providers, all tools Import scanning finds nothing
2 Factory/registry with __init_subclass__ core/_registry.py, providers/_base.py No direct OpenAI() or Anthropic() calls visible
3 Mixed Python + TypeScript tools/search.ts called via tools/search_bridge.py LLM calls cross a subprocess boundary
4 Misleading filenames core/agent.py = data models, core/models.py = orchestrator Static analysis follows wrong file
5 Fake instrumentation already present tools/_instrument.py mimics Opik's API Looks already instrumented but does nothing
6 Monkey-patching at import time core/__init__.py adds .run() to Agent class Behavior changes when you import the package
7 Client creation inside decorators core/middleware.py @with_llm() injects client as kwarg No module-level client to wrap
8 Client creation inside descriptors core/middleware.py _LLMDescriptor.__get__ Client created on first attribute access
9 Proxy objects wrapping real clients core/_proxy.py LLMProxy with __getattr__ Instrumentation can't see through the proxy
10 Multiple entry points cli.py, serve.py, run.sh No single "main" to instrument
11 Environment-driven provider switching providers/_router.py reads $PROVIDER / $LLM_BACKEND Provider unknown until runtime
12 cached_property for client creation providers/oai.py Client created lazily, not at import
13 Class-level client cache in method body providers/claude.py Import + instantiation hidden in method
14 Async generators yielding intermediate results core/models.py, tools/summarize.py Complex control flow hard to wrap
15 LangGraph via dynamic import + provider indirection tools/graph_flow.py 3 layers between code and actual LLM call
16 Google ADK with runtime config tools/adk_agent.py Conditional import, SimpleNamespace config
17 CrewAI via __import__ + getattr for all classes tools/crew.py No scannable CrewAI imports at all

Frameworks Used (All Hidden)

Framework File How It's Hidden
OpenAI (Python) providers/oai.py __import__("openai") + cached_property — no import openai anywhere
Anthropic providers/claude.py __import__("anthropic") inside method body, class-level cache
LangChain tools/summarize.py importlib.import_module("langchain_openai") + getattr
LangGraph tools/graph_flow.py importlib.import_module("langgraph.graph") + provider registry for LLM
Google ADK tools/adk_agent.py Conditional __import__("google.adk"), SimpleNamespace config object
CrewAI tools/crew.py __import__("crewai") + getattr for Agent, Task, Crew, Process
OpenAI (TypeScript) tools/search.ts Separate language, invoked as subprocess via search_bridge.py

File Structure

babadook/
├── cli.py                      # CLI entry point (Click)
├── serve.py                    # HTTP entry point (FastAPI)
├── run.sh                      # Shell entry point — calls Python AND TypeScript
├── main.py                     # Decoy entry point
├── core/
│   ├── __init__.py             # Monkey-patches Agent class on import
│   ├── agent.py                # MISLEADING: just Pydantic data models
│   ├── models.py               # MISLEADING: the actual agent orchestrator
│   ├── _registry.py            # Provider registry — dynamic import + factory
│   ├── _proxy.py               # Proxy object that wraps any LLM client
│   └── middleware.py           # Decorators/descriptors that create LLM clients
├── tools/
│   ├── __init__.py
│   ├── search.ts               # TypeScript tool — uses OpenAI for embeddings
│   ├── search_bridge.py        # Python subprocess caller for search.ts
│   ├── summarize.py            # LangChain via importlib
│   ├── graph_flow.py           # LangGraph workflow via dynamic import
│   ├── adk_agent.py            # Google ADK agent — conditional, unusual setup
│   ├── crew.py                 # CrewAI via __import__ + getattr
│   └── _instrument.py          # FAKE instrumentation — mimics Opik but isn't
├── providers/
│   ├── __init__.py
│   ├── _base.py                # Abstract base with __init_subclass__ registration
│   ├── oai.py                  # OpenAI — __import__ + cached_property
│   ├── claude.py               # Anthropic — lazy import in method body
│   └── _router.py              # Env-var-based provider selection
├── package.json
├── tsconfig.json
├── pyproject.toml
└── .env.example

Setup

cp .env.example .env
# Add OPENAI_API_KEY (required), ANTHROPIC_API_KEY (optional), GOOGLE_API_KEY (optional)

uv sync
npm install

Usage

# CLI
uv run python cli.py "What is quantum computing?" --provider=openai -v

# With LangGraph multi-step flow
uv run python cli.py "What is CRISPR?" --provider=openai --graph -v

# With CrewAI fact-checking
uv run python cli.py "What is CRISPR?" --provider=anthropic --verify -v

# HTTP server
uv run python serve.py

Testing Instrumentation Tools

Delete README.md before running instrumentation tools — it gives away all the patterns.

rm README.md
# Run your instrumentation tool here
git checkout README.md

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published