The Babadook Observability Test

Adversarial Benchmark Agent for LLM observability instrumentation tools.

Babadook is a functional multi-step research agent that is intentionally structured to defeat auto-instrumentation. It uses real frameworks and makes real LLM calls, but every integration is hidden behind layers of indirection.

What It Does

Takes a query, searches for information (via TypeScript embeddings), summarizes findings (via LangChain), optionally runs a multi-step reasoning flow (via LangGraph), optionally fact-checks (via CrewAI), and synthesizes a final answer. It can also use Google ADK for grounded search.

Anti-Instrumentation Patterns

#	Pattern	Where	Why It's Hard
1	Dynamic imports (`__import__`, `importlib`)	All providers, all tools	Import scanning finds nothing
2	Factory/registry with `__init_subclass__`	`core/_registry.py`, `providers/_base.py`	No direct `OpenAI()` or `Anthropic()` calls visible
3	Mixed Python + TypeScript	`tools/search.ts` called via `tools/search_bridge.py`	LLM calls cross a subprocess boundary
4	Misleading filenames	`core/agent.py` = data models, `core/models.py` = orchestrator	Static analysis follows wrong file
5	Fake instrumentation already present	`tools/_instrument.py` mimics Opik's API	Looks already instrumented but does nothing
6	Monkey-patching at import time	`core/__init__.py` adds `.run()` to Agent class	Behavior changes when you import the package
7	Client creation inside decorators	`core/middleware.py` `@with_llm()` injects client as kwarg	No module-level client to wrap
8	Client creation inside descriptors	`core/middleware.py` `_LLMDescriptor.__get__`	Client created on first attribute access
9	Proxy objects wrapping real clients	`core/_proxy.py` `LLMProxy` with `__getattr__`	Instrumentation can't see through the proxy
10	Multiple entry points	`cli.py`, `serve.py`, `run.sh`	No single "main" to instrument
11	Environment-driven provider switching	`providers/_router.py` reads `$PROVIDER` / `$LLM_BACKEND`	Provider unknown until runtime
12	`cached_property` for client creation	`providers/oai.py`	Client created lazily, not at import
13	Class-level client cache in method body	`providers/claude.py`	Import + instantiation hidden in method
14	Async generators yielding intermediate results	`core/models.py`, `tools/summarize.py`	Complex control flow hard to wrap
15	LangGraph via dynamic import + provider indirection	`tools/graph_flow.py`	3 layers between code and actual LLM call
16	Google ADK with runtime config	`tools/adk_agent.py`	Conditional import, `SimpleNamespace` config
17	CrewAI via `__import__` + `getattr` for all classes	`tools/crew.py`	No scannable CrewAI imports at all

Frameworks Used (All Hidden)

Framework	File	How It's Hidden
OpenAI (Python)	`providers/oai.py`	`__import__("openai")` + `cached_property` — no `import openai` anywhere
Anthropic	`providers/claude.py`	`__import__("anthropic")` inside method body, class-level cache
LangChain	`tools/summarize.py`	`importlib.import_module("langchain_openai")` + `getattr`
LangGraph	`tools/graph_flow.py`	`importlib.import_module("langgraph.graph")` + provider registry for LLM
Google ADK	`tools/adk_agent.py`	Conditional `__import__("google.adk")`, `SimpleNamespace` config object
CrewAI	`tools/crew.py`	`__import__("crewai")` + `getattr` for Agent, Task, Crew, Process
OpenAI (TypeScript)	`tools/search.ts`	Separate language, invoked as subprocess via `search_bridge.py`

File Structure

babadook/
├── cli.py                      # CLI entry point (Click)
├── serve.py                    # HTTP entry point (FastAPI)
├── run.sh                      # Shell entry point — calls Python AND TypeScript
├── main.py                     # Decoy entry point
├── core/
│   ├── __init__.py             # Monkey-patches Agent class on import
│   ├── agent.py                # MISLEADING: just Pydantic data models
│   ├── models.py               # MISLEADING: the actual agent orchestrator
│   ├── _registry.py            # Provider registry — dynamic import + factory
│   ├── _proxy.py               # Proxy object that wraps any LLM client
│   └── middleware.py           # Decorators/descriptors that create LLM clients
├── tools/
│   ├── __init__.py
│   ├── search.ts               # TypeScript tool — uses OpenAI for embeddings
│   ├── search_bridge.py        # Python subprocess caller for search.ts
│   ├── summarize.py            # LangChain via importlib
│   ├── graph_flow.py           # LangGraph workflow via dynamic import
│   ├── adk_agent.py            # Google ADK agent — conditional, unusual setup
│   ├── crew.py                 # CrewAI via __import__ + getattr
│   └── _instrument.py          # FAKE instrumentation — mimics Opik but isn't
├── providers/
│   ├── __init__.py
│   ├── _base.py                # Abstract base with __init_subclass__ registration
│   ├── oai.py                  # OpenAI — __import__ + cached_property
│   ├── claude.py               # Anthropic — lazy import in method body
│   └── _router.py              # Env-var-based provider selection
├── package.json
├── tsconfig.json
├── pyproject.toml
└── .env.example

Setup

cp .env.example .env
# Add OPENAI_API_KEY (required), ANTHROPIC_API_KEY (optional), GOOGLE_API_KEY (optional)

uv sync
npm install

Usage

# CLI
uv run python cli.py "What is quantum computing?" --provider=openai -v

# With LangGraph multi-step flow
uv run python cli.py "What is CRISPR?" --provider=openai --graph -v

# With CrewAI fact-checking
uv run python cli.py "What is CRISPR?" --provider=anthropic --verify -v

# HTTP server
uv run python serve.py

Testing Instrumentation Tools

Delete README.md before running instrumentation tools — it gives away all the patterns.

rm README.md
# Run your instrumentation tool here
git checkout README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Babadook Observability Test

What It Does

Anti-Instrumentation Patterns

Frameworks Used (All Hidden)

File Structure

Setup

Usage

Testing Instrumentation Tools

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
core		core
providers		providers
tools		tools
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
serve.py		serve.py
tsconfig.json		tsconfig.json
uv.lock		uv.lock

License

comet-ml/adversarial-benchmark-agent

Folders and files

Latest commit

History

Repository files navigation

The Babadook Observability Test

What It Does

Anti-Instrumentation Patterns

Frameworks Used (All Hidden)

File Structure

Setup

Usage

Testing Instrumentation Tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages