Skip to content

Commit 6782c76

Browse files
committed
adds otel kubernetes docs
Signed-off-by: ChrisJBurns <[email protected]>
1 parent 88d1c4d commit 6782c76

File tree

1 file changed

+282
-0
lines changed

1 file changed

+282
-0
lines changed
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
---
2+
title: Telemetry (metrics and traces)
3+
description:
4+
How to enable OpenTelemetry (metrics and traces) and Prometheus
5+
instrumentation for ToolHive MCP servers inside of Kubernetes using the
6+
ToolHive Operator
7+
---
8+
9+
ToolHive includes built-in instrumentation using OpenTelemetry, which gives you
10+
comprehensive observability for your MCP server interactions. You can export
11+
traces and metrics to popular observability backends like Jaeger, Honeycomb,
12+
Datadog, and Grafana Cloud, or expose Prometheus metrics directly.
13+
14+
## What you can monitor
15+
16+
ToolHive's telemetry captures detailed information about MCP interactions
17+
including traces, metrics, and performance data. For a comprehensive overview of
18+
the telemetry architecture, metrics collection, and monitoring capabilities, see
19+
the [observability overview](../concepts/observability.md).
20+
21+
## Enable telemetry
22+
23+
You can enable telemetry when deploying an MCP server by specifying Telemetry
24+
configuration in the `MCPServer` custom resource.
25+
26+
This example runs the Fetch MCP server and exports traces to a deployed instance
27+
of the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/):
28+
29+
```yaml
30+
apiVersion: toolhive.stacklok.dev/v1alpha1
31+
kind: MCPServer
32+
metadata:
33+
name: gofetch
34+
namespace: toolhive-system
35+
spec:
36+
image: ghcr.io/stackloklabs/gofetch/server
37+
transport: streamable-http
38+
port: 8080
39+
targetPort: 8080
40+
...
41+
...
42+
telemetry:
43+
openTelemetry:
44+
enabled: true
45+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
46+
serviceName: mcp-fetch-server
47+
insecure: true
48+
metrics:
49+
enabled: true
50+
tracing:
51+
enabled: true
52+
samplingRate: "0.05"
53+
prometheus:
54+
enabled: true
55+
```
56+
57+
The `spec.telemetry.openTelemetry.endpoint` will be the OpenTelemetry collector
58+
that is deployed inside of your infrastructure, the
59+
`spec.telemetry.openTelemetry.serviceName` will be what you can use to identify
60+
your MCP server in your observability stack.
61+
62+
### Export metrics to an OTLP endpoint
63+
64+
If you want to enable ToolHive to export metrics to your OTel collector, you can
65+
enable the `spec.telemetry.openTelemetry.metrics.enabled` flag.
66+
67+
### Export traces to an OTLP endpoint
68+
69+
If you want to enable ToolHive to export tracing information, you can enable the
70+
`spec.telemetry.openTelemetry.tracing.enabled` flag.
71+
72+
You can also set the sampling rate of your traces by setting the
73+
`spec.telemetry.openTelemetry.tracing.sampleRate` option to a number netween 0
74+
and 1.0. By default this will be `0.05` which equates to 5% of all requests.
75+
76+
:::note
77+
78+
The `spec.telemetry.openTelemetry.endpoint` is provided as a hostname and
79+
optional port, without a scheme or path (e.g., use `api.honeycomb.io` or
80+
`api.honeycomb.io:443`, not `https://api.honeycomb.io`). ToolHive automatically
81+
uses HTTPS unless `--otel-insecure` is specified.
82+
83+
:::
84+
85+
By default, the service name is set to `toolhive-mcp-proxy`, and the sampling
86+
rate is `0.05` (5%).
87+
88+
:::tip[Recommendation]
89+
90+
Set the `spec.telemetry.openTelemetry.serviceName` flag to a meaningful name for
91+
each MCP server. This helps you identify the server in your observability
92+
backend.
93+
94+
:::
95+
96+
### Enable Prometheus metrics
97+
98+
You can expose Prometheus-style metrics at `/metrics` on the main transport port
99+
for local scraping by enabling the `spec.telemetry.prometheus.enabled` flag.
100+
101+
To access the metrics, you can use `curl` or any Prometheus-compatible scraper.
102+
The metrics are available at `http://<HOST>:<PORT>/metrics`, where `<HOST` is
103+
resolvable address of the ToolHive ProxyRunner fronting your MCP server pod and
104+
`<PORT>` is the port of which the ProxyRunner service is configured to expose
105+
for traffic.
106+
107+
### Dual export
108+
109+
You can export to both an OTLP endpoint and expose Prometheus metrics
110+
simultaneously.
111+
112+
The `MCPServer` example at the top of this page has dual export enabled.
113+
114+
## Observability backends
115+
116+
ToolHive can export telemetry data to many different observability backends. It
117+
supports exporting traces and metrics to any backend that implements the OTLP
118+
protocol. Some common examples are listed below, but specific configurations
119+
will vary based on your environment and requirements.
120+
121+
### OpenTelemetry Collector (recommended)
122+
123+
The OpenTelemetry Collector is a vendor-agnostic way to receive, process and
124+
export telemetry data. It supports many backend services, scalable deployment
125+
options, and advanced processing capabilities.
126+
127+
```mermaid
128+
graph LR
129+
A[ToolHive] -->|traces & metrics| B[OpenTelemetry Collector]
130+
B --> C[AWS CloudWatch]
131+
B --> D[Splunk]
132+
B --> E[New Relic]
133+
B <--> E[Prometheus]
134+
B --> F[Other OTLP backends]
135+
```
136+
137+
You can run the OpenTelemetry Collector inside of a Kubernetes cluster, follow
138+
the
139+
[OpenTelemetry Collector documentation](https://opentelemetry.io/docs/collector/)
140+
for more information.
141+
142+
To export data to a local OpenTelemetry Collector, set your OTLP endpoint to the
143+
OTLP http receiver port (default is `4318`):
144+
145+
```yaml
146+
apiVersion: toolhive.stacklok.dev/v1alpha1
147+
kind: MCPServer
148+
metadata:
149+
name: gofetch
150+
namespace: toolhive-system
151+
spec:
152+
...
153+
...
154+
telemetry:
155+
openTelemetry:
156+
enabled: true
157+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
158+
serviceName: mcp-fetch-server
159+
insecure: true
160+
metrics:
161+
enabled: true
162+
```
163+
164+
### Prometheus
165+
166+
To collect metrics using Prometheus, run your MCP server with the
167+
`spec.telemetry.prometheus.enabled` flag enabled and add the following to your
168+
Prometheus configuration:
169+
170+
```yaml title="prometheus.yml"
171+
scrape_configs:
172+
- job_name: 'toolhive-mcp-proxy'
173+
static_configs:
174+
- targets: ['<MCP_SERVER_PROXY_SVC_URL>:<MCP_SERVER_PORT>']
175+
scrape_interval: 15s
176+
metrics_path: /metrics
177+
```
178+
179+
You can add multiple MCP servers to the `targets` list. Replace
180+
`<MCP_SERVER_PROXY_SVC_URL>` with the ProxyRunner SVC name and
181+
`<MCP_SERVER_PORT>` with the port number exposed by the SVC.
182+
183+
### Jaeger
184+
185+
[Jaeger](https://www.jaegertracing.io) is a popular open-source distributed
186+
tracing system. You can run it inside of a Kubernetes cluster in order to store
187+
tracing telemetry data exported by the ToolHive proxy.
188+
189+
You can export traces to Jaeger by setting the OTLP endpoint to an OpenTelemetry
190+
collector, and then configuring the collector to export tracing data to Jaeger.
191+
192+
```yaml
193+
apiVersion: toolhive.stacklok.dev/v1alpha1
194+
kind: MCPServer
195+
metadata:
196+
name: gofetch
197+
namespace: toolhive-system
198+
spec:
199+
...
200+
...
201+
telemetry:
202+
openTelemetry:
203+
enabled: true
204+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
205+
serviceName: mcp-fetch-server
206+
insecure: true
207+
tracing:
208+
enabled: true
209+
```
210+
211+
Inside of your OpenTelemetry collector configuration.
212+
213+
```yaml
214+
config:
215+
receivers:
216+
otlp:
217+
protocols:
218+
grpc:
219+
endpoint: 0.0.0.0:4317
220+
http:
221+
endpoint: 0.0.0.0:4318
222+
223+
exporters:
224+
otlp/jaeger:
225+
endpoint: http://jaeger-all-in-one-collector.monitoring:4317
226+
227+
service:
228+
pipelines:
229+
traces:
230+
receivers: [otlp]
231+
processors: [batch]
232+
exporters: [otlp/jaeger]
233+
```
234+
235+
### Honeycomb
236+
237+
Coming soon.
238+
239+
You'll need your Honeycomb API key, which you can find in your
240+
[Honeycomb account settings](https://ui.honeycomb.io/account).
241+
242+
### Datadog
243+
244+
Datadog has [multiple options](https://docs.datadoghq.com/opentelemetry/) for
245+
collecting OpenTelemetry data:
246+
247+
- The
248+
[**OpenTelemetry Collector**](https://docs.datadoghq.com/opentelemetry/setup/collector_exporter/)
249+
is recommended for existing OpenTelemetry users or users wanting a
250+
vendor-neutral solution.
251+
252+
- The [**Datadog Agent**](https://docs.datadoghq.com/opentelemetry/setup/agent)
253+
is recommended for existing Datadog users.
254+
255+
### Grafana Cloud
256+
257+
Coming soon.
258+
259+
## Performance considerations
260+
261+
### Sampling rates
262+
263+
Adjust sampling rates based on your environment:
264+
265+
- **Development**: `--otel-sampling-rate 1.0` (100% sampling)
266+
- **Production**: `--otel-sampling-rate 0.01` (1% sampling for high-traffic
267+
systems)
268+
- **Default**: `--otel-sampling-rate 0.05` (5% sampling)
269+
270+
### Network overhead
271+
272+
Telemetry adds minimal overhead when properly configured:
273+
274+
- Use appropriate sampling rates for your traffic volume
275+
- Monitor your observability backend costs and adjust sampling accordingly
276+
277+
## Related information
278+
279+
- [Kubernetes CRD reference](../reference/crd-spec.mdx) - Reference for the
280+
`MCPServer` Custom Resource Definition (CRD)
281+
- [Deploy the operator using Helm](./deploy-operator-helm.md) - Install the
282+
ToolHive operator

0 commit comments

Comments
 (0)