Lightweight Prometheus metrics exporter for AI agent pipelines. Wraps LLM client SDKs to expose token usage, latency, and tool calls as native Prometheus metrics. No OpenTelemetry or other external infrastructure required.
Supports:
- Anthropic (
anthropic.Anthropic,anthropic.AsyncAnthropic) - OpenAI and OpenAI-compatible providers (
openai.OpenAI,openai.AsyncOpenAI, OpenRouter, Together, Groq, etc.) - LangChain and LangGraph (via callback handler; no client wrapping)
pip install agentgaugeimport anthropic
from agentgauge import instrument
client = instrument(anthropic.Anthropic())
# Use client exactly as you would normally
response = client.messages.create(...)
# Metrics available at http://localhost:9464/metricsAsync clients are supported with the same API:
import anthropic
from agentgauge import instrument
client = instrument(anthropic.AsyncAnthropic())
response = await client.messages.create(...)
async with client.messages.stream(...) as stream:
async for event in stream:
...Works with OpenAI and any provider using the OpenAI SDK:
from openai import OpenAI
from agentgauge import instrument
# Standard OpenAI
client = instrument(OpenAI())
# Or with an OpenAI-compatible provider (OpenRouter, Together, Groq, etc.)
client = instrument(
OpenAI(
api_key="your-api-key",
base_url="https://your-provider.com/v1",
)
)
response = client.chat.completions.create(...)
# Metrics available at http://localhost:9464/metricsAsync clients are supported with the same API:
from openai import AsyncOpenAI
from agentgauge import instrument
client = instrument(AsyncOpenAI())
response = await client.chat.completions.create(...)
stream = await client.chat.completions.stream(...)
async with stream as s:
async for chunk in s:
...pip install agentgauge[langchain] langchain-openaifrom langchain_openai import ChatOpenAI
from agentgauge import AgentGaugeCallbackHandler
handler = AgentGaugeCallbackHandler()
llm = ChatOpenAI(model="gpt-4o", callbacks=[handler])
response = llm.invoke("Hello!")Uses the same callback handler as LangChain. Just pass it via RunnableConfig when invoking the agent. This ensures it propagates to all graph nodes; LLM calls and tool calls. Attaching only to the LLM constructor will miss tool-node callbacks.
pip install agentgauge[langchain] langchain-openai langgraphfrom langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableConfig
from langgraph.prebuilt import create_react_agent
from agentgauge import AgentGaugeCallbackHandler
handler = AgentGaugeCallbackHandler()
llm = ChatOpenAI(model="gpt-4o")
agent = create_react_agent(llm, tools=[...])
config = RunnableConfig(callbacks=[handler])
result = agent.invoke({"messages": [...]}, config=config)| Metric | Type | Labels |
|---|---|---|
llm_requests_total |
Counter | model, method, status |
llm_request_duration_seconds |
Histogram | model, method |
llm_tokens_total |
Counter | model, token_type |
llm_active_requests |
Gauge | model |
llm_tool_calls_total |
Counter | model, tool_name |
llm_tool_duration_seconds |
Histogram | tool_name |
llm_cache_tokens_total |
Counter | model, cache_type |
Note:
llm_tool_duration_secondsis only available for LangChain/LangGraph workflows where tool execution is tracked via callbacks. Direct SDK wrappers (Anthropic, OpenAI) only see tool calls in the LLM response, not the actual tool execution.
| Query | Description |
|---|---|
rate(llm_tokens_total[1h]) |
Token usage over time |
llm_active_requests |
Current active requests |
sum(rate(llm_requests_total{status="error"}[5m])) / sum(rate(llm_requests_total[5m])) |
Error rate |
histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) |
p95 latency |
See prometheus.yml for example configuration.