Adversarial Benchmark Agent for LLM observability instrumentation tools.
Babadook is a functional multi-step research agent that is intentionally structured to defeat auto-instrumentation. It uses real frameworks and makes real LLM calls, but every integration is hidden behind layers of indirection.
Takes a query, searches for information (via TypeScript embeddings), summarizes findings (via LangChain), optionally runs a multi-step reasoning flow (via LangGraph), optionally fact-checks (via CrewAI), and synthesizes a final answer. It can also use Google ADK for grounded search.
| # | Pattern | Where | Why It's Hard |
|---|---|---|---|
| 1 | Dynamic imports (__import__, importlib) |
All providers, all tools | Import scanning finds nothing |
| 2 | Factory/registry with __init_subclass__ |
core/_registry.py, providers/_base.py |
No direct OpenAI() or Anthropic() calls visible |
| 3 | Mixed Python + TypeScript | tools/search.ts called via tools/search_bridge.py |
LLM calls cross a subprocess boundary |
| 4 | Misleading filenames | core/agent.py = data models, core/models.py = orchestrator |
Static analysis follows wrong file |
| 5 | Fake instrumentation already present | tools/_instrument.py mimics Opik's API |
Looks already instrumented but does nothing |
| 6 | Monkey-patching at import time | core/__init__.py adds .run() to Agent class |
Behavior changes when you import the package |
| 7 | Client creation inside decorators | core/middleware.py @with_llm() injects client as kwarg |
No module-level client to wrap |
| 8 | Client creation inside descriptors | core/middleware.py _LLMDescriptor.__get__ |
Client created on first attribute access |
| 9 | Proxy objects wrapping real clients | core/_proxy.py LLMProxy with __getattr__ |
Instrumentation can't see through the proxy |
| 10 | Multiple entry points | cli.py, serve.py, run.sh |
No single "main" to instrument |
| 11 | Environment-driven provider switching | providers/_router.py reads $PROVIDER / $LLM_BACKEND |
Provider unknown until runtime |
| 12 | cached_property for client creation |
providers/oai.py |
Client created lazily, not at import |
| 13 | Class-level client cache in method body | providers/claude.py |
Import + instantiation hidden in method |
| 14 | Async generators yielding intermediate results | core/models.py, tools/summarize.py |
Complex control flow hard to wrap |
| 15 | LangGraph via dynamic import + provider indirection | tools/graph_flow.py |
3 layers between code and actual LLM call |
| 16 | Google ADK with runtime config | tools/adk_agent.py |
Conditional import, SimpleNamespace config |
| 17 | CrewAI via __import__ + getattr for all classes |
tools/crew.py |
No scannable CrewAI imports at all |
Frameworks Used (All Hidden)
| Framework | File | How It's Hidden |
|---|---|---|
| OpenAI (Python) | providers/oai.py |
__import__("openai") + cached_property — no import openai anywhere |
| Anthropic | providers/claude.py |
__import__("anthropic") inside method body, class-level cache |
| LangChain | tools/summarize.py |
importlib.import_module("langchain_openai") + getattr |
| LangGraph | tools/graph_flow.py |
importlib.import_module("langgraph.graph") + provider registry for LLM |
| Google ADK | tools/adk_agent.py |
Conditional __import__("google.adk"), SimpleNamespace config object |
| CrewAI | tools/crew.py |
__import__("crewai") + getattr for Agent, Task, Crew, Process |
| OpenAI (TypeScript) | tools/search.ts |
Separate language, invoked as subprocess via search_bridge.py |
babadook/
├── cli.py # CLI entry point (Click)
├── serve.py # HTTP entry point (FastAPI)
├── run.sh # Shell entry point — calls Python AND TypeScript
├── main.py # Decoy entry point
├── core/
│ ├── __init__.py # Monkey-patches Agent class on import
│ ├── agent.py # MISLEADING: just Pydantic data models
│ ├── models.py # MISLEADING: the actual agent orchestrator
│ ├── _registry.py # Provider registry — dynamic import + factory
│ ├── _proxy.py # Proxy object that wraps any LLM client
│ └── middleware.py # Decorators/descriptors that create LLM clients
├── tools/
│ ├── __init__.py
│ ├── search.ts # TypeScript tool — uses OpenAI for embeddings
│ ├── search_bridge.py # Python subprocess caller for search.ts
│ ├── summarize.py # LangChain via importlib
│ ├── graph_flow.py # LangGraph workflow via dynamic import
│ ├── adk_agent.py # Google ADK agent — conditional, unusual setup
│ ├── crew.py # CrewAI via __import__ + getattr
│ └── _instrument.py # FAKE instrumentation — mimics Opik but isn't
├── providers/
│ ├── __init__.py
│ ├── _base.py # Abstract base with __init_subclass__ registration
│ ├── oai.py # OpenAI — __import__ + cached_property
│ ├── claude.py # Anthropic — lazy import in method body
│ └── _router.py # Env-var-based provider selection
├── package.json
├── tsconfig.json
├── pyproject.toml
└── .env.example
cp .env.example .env
# Add OPENAI_API_KEY (required), ANTHROPIC_API_KEY (optional), GOOGLE_API_KEY (optional)
uv sync
npm install# CLI
uv run python cli.py "What is quantum computing?" --provider=openai -v
# With LangGraph multi-step flow
uv run python cli.py "What is CRISPR?" --provider=openai --graph -v
# With CrewAI fact-checking
uv run python cli.py "What is CRISPR?" --provider=anthropic --verify -v
# HTTP server
uv run python serve.pyDelete README.md before running instrumentation tools — it gives away all the patterns.
rm README.md
# Run your instrumentation tool here
git checkout README.md