Skip to content

Commit 2c40923

Browse files
ChrisJBurnsCopilotdanbarr
authored
Adds Docs for OpenTelemetry Support in ToolHive Operator & Kubernetes (#166)
* adds otel kubernetes docs Signed-off-by: ChrisJBurns <[email protected]> * adds sidebar menu for telemetry docs Signed-off-by: ChrisJBurns <[email protected]> * Update docs/toolhive/guides-k8s/telemetry-and-metrics.md Co-authored-by: Copilot <[email protected]> * Update docs/toolhive/guides-k8s/telemetry-and-metrics.md Co-authored-by: Copilot <[email protected]> * Update docs/toolhive/guides-k8s/telemetry-and-metrics.md Co-authored-by: Copilot <[email protected]> * fixes format issues Signed-off-by: ChrisJBurns <[email protected]> * modify mermaid diagram Signed-off-by: ChrisJBurns <[email protected]> * amends CLI docs with new telemetry flags Signed-off-by: ChrisJBurns <[email protected]> * adds missing flags Signed-off-by: ChrisJBurns <[email protected]> * removes unneeded flags Signed-off-by: ChrisJBurns <[email protected]> * disables metrics for jaeger example Signed-off-by: ChrisJBurns <[email protected]> * oppsie, meant to add to jaeger section Signed-off-by: ChrisJBurns <[email protected]> * Update docs/toolhive/guides-cli/telemetry-and-metrics.md Co-authored-by: Dan Barr <[email protected]> --------- Signed-off-by: ChrisJBurns <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Dan Barr <[email protected]>
1 parent 14d16c3 commit 2c40923

File tree

3 files changed

+302
-9
lines changed

3 files changed

+302
-9
lines changed

docs/toolhive/guides-cli/telemetry-and-metrics.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -132,20 +132,23 @@ when running an MCP server with the `thv run` command:
132132

133133
```bash
134134
thv run [--otel-endpoint <URL>] [--otel-service-name <NAME>] \
135+
[--otel-metrics-enabled=<true|false>] [--otel-tracing-enabled=<true|false>] \
135136
[--otel-sampling-rate <RATE>] [--otel-headers <KEY=VALUE>] \
136137
[--otel-insecure] [--otel-enable-prometheus-metrics-path] \
137138
<SERVER>
138139
```
139140

140-
| Flag | Description | Default |
141-
| --------------------------------------- | ----------------------------------------------------------- | -------------------- |
142-
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
143-
| `--otel-service-name` | Service name for telemetry | `toolhive-mcp-proxy` |
144-
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
145-
| `--otel-headers` | Authentication headers in `key=value` format | None |
146-
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
147-
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
148-
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
141+
| Flag | Description | Default |
142+
| --------------------------------------- | ------------------------------------------------------------- | -------------------- |
143+
| `--otel-endpoint` | OTLP endpoint (e.g., `api.honeycomb.io`) | None |
144+
| `--otel-metrics-enabled` | Enable OTLP metrics export (when OTLP endpoint is configured) | `true` |
145+
| `--otel-tracing-enabled` | Enable distributed tracing (when OTLP endpoint is configured) | `true` |
146+
| `--otel-service-name` | Service name for telemetry | `toolhive-mcp-proxy` |
147+
| `--otel-sampling-rate` | Trace sampling rate (0.0-1.0) | `0.1` (10%) |
148+
| `--otel-headers` | Authentication headers in `key=value` format | None |
149+
| `--otel-env-vars` | List of environment variables to include in telemetry spans | None |
150+
| `--otel-insecure` | Connect using HTTP instead of HTTPS | `false` |
151+
| `--otel-enable-prometheus-metrics-path` | Enable `/metrics` endpoint | `false` |
149152

150153
### Global configuration
151154

@@ -162,7 +165,11 @@ rate:
162165

163166
```bash
164167
thv config otel set-endpoint api.honeycomb.io
168+
thv config otel set-metrics-enabled true
169+
thv config otel set-tracing-enabled true
165170
thv config otel set-sampling-rate 0.25
171+
thv config otel set-enable-prometheus-metrics-path true
172+
thv config otel set-insecure true
166173
```
167174

168175
Each command has a corresponding `get` and `unset` command to retrieve or remove
@@ -240,6 +247,7 @@ by setting the OTLP endpoint to Jaeger's collector:
240247
```bash
241248
thv run \
242249
--otel-endpoint localhost:4318 \
250+
--otel-metrics-enabled=false \
243251
--otel-insecure \
244252
<SERVER>
245253
```
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
---
2+
title: Telemetry (metrics and traces)
3+
description:
4+
How to enable OpenTelemetry (metrics and traces) and Prometheus
5+
instrumentation for ToolHive MCP servers inside of Kubernetes using the
6+
ToolHive Operator
7+
---
8+
9+
ToolHive includes built-in instrumentation using OpenTelemetry, which gives you
10+
comprehensive observability for your MCP server interactions. You can export
11+
traces and metrics to popular observability backends like Jaeger, Honeycomb,
12+
Datadog, and Grafana Cloud, or expose Prometheus metrics directly.
13+
14+
## What you can monitor
15+
16+
ToolHive's telemetry captures detailed information about MCP interactions
17+
including traces, metrics, and performance data. For a comprehensive overview of
18+
the telemetry architecture, metrics collection, and monitoring capabilities, see
19+
the [observability overview](../concepts/observability.md).
20+
21+
## Enable telemetry
22+
23+
You can enable telemetry when deploying an MCP server by specifying Telemetry
24+
configuration in the `MCPServer` custom resource.
25+
26+
This example runs the Fetch MCP server and exports traces to a deployed instance
27+
of the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/):
28+
29+
```yaml
30+
apiVersion: toolhive.stacklok.dev/v1alpha1
31+
kind: MCPServer
32+
metadata:
33+
name: gofetch
34+
namespace: toolhive-system
35+
spec:
36+
image: ghcr.io/stackloklabs/gofetch/server
37+
transport: streamable-http
38+
port: 8080
39+
targetPort: 8080
40+
...
41+
...
42+
telemetry:
43+
openTelemetry:
44+
enabled: true
45+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
46+
serviceName: mcp-fetch-server
47+
insecure: true
48+
metrics:
49+
enabled: true
50+
tracing:
51+
enabled: true
52+
samplingRate: "0.05"
53+
prometheus:
54+
enabled: true
55+
```
56+
57+
The `spec.telemetry.openTelemetry.endpoint` will be the OpenTelemetry collector
58+
that is deployed inside of your infrastructure, the
59+
`spec.telemetry.openTelemetry.serviceName` will be what you can use to identify
60+
your MCP server in your observability stack.
61+
62+
### Export metrics to an OTLP endpoint
63+
64+
If you want to enable ToolHive to export metrics to your OTel collector, you can
65+
enable the `spec.telemetry.openTelemetry.metrics.enabled` flag.
66+
67+
### Export traces to an OTLP endpoint
68+
69+
If you want to enable ToolHive to export tracing information, you can enable the
70+
`spec.telemetry.openTelemetry.tracing.enabled` flag.
71+
72+
You can also set the sampling rate of your traces by setting the
73+
`spec.telemetry.openTelemetry.tracing.sampleRate` option to a number between 0
74+
and 1.0. By default this will be `0.05` which equates to 5% of all requests.
75+
76+
:::note
77+
78+
The `spec.telemetry.openTelemetry.endpoint` is provided as a hostname and
79+
optional port, without a scheme or path (e.g., use `api.honeycomb.io` or
80+
`api.honeycomb.io:443`, not `https://api.honeycomb.io`). ToolHive automatically
81+
uses HTTPS unless `--otel-insecure` is specified.
82+
83+
:::
84+
85+
By default, the service name is set to `toolhive-mcp-proxy`, and the sampling
86+
rate is `0.05` (5%).
87+
88+
:::tip[Recommendation]
89+
90+
Set the `spec.telemetry.openTelemetry.serviceName` flag to a meaningful name for
91+
each MCP server. This helps you identify the server in your observability
92+
backend.
93+
94+
:::
95+
96+
### Enable Prometheus metrics
97+
98+
You can expose Prometheus-style metrics at `/metrics` on the main transport port
99+
for local scraping by enabling the `spec.telemetry.prometheus.enabled` flag.
100+
101+
To access the metrics, you can use `curl` or any Prometheus-compatible scraper.
102+
The metrics are available at `http://<HOST>:<PORT>/metrics`, where `<HOST>` is
103+
resolvable address of the ToolHive ProxyRunner fronting your MCP server pod and
104+
`<PORT>` is the port of which the ProxyRunner service is configured to expose
105+
for traffic.
106+
107+
### Dual export
108+
109+
You can export to both an OTLP endpoint and expose Prometheus metrics
110+
simultaneously.
111+
112+
The `MCPServer` example at the top of this page has dual export enabled.
113+
114+
## Observability backends
115+
116+
ToolHive can export telemetry data to many different observability backends. It
117+
supports exporting traces and metrics to any backend that implements the OTLP
118+
protocol. Some common examples are listed below, but specific configurations
119+
will vary based on your environment and requirements.
120+
121+
### OpenTelemetry Collector (recommended)
122+
123+
The OpenTelemetry Collector is a vendor-agnostic way to receive, process and
124+
export telemetry data. It supports many backend services, scalable deployment
125+
options, and advanced processing capabilities.
126+
127+
```mermaid
128+
graph LR
129+
A[ToolHive] -->|traces & metrics| B[OpenTelemetry Collector]
130+
B --> C[AWS CloudWatch]
131+
B --> D[Splunk]
132+
B --> E[New Relic]
133+
B <--> F[Prometheus]
134+
B --> G[Other OTLP backends]
135+
```
136+
137+
You can run the OpenTelemetry Collector inside of a Kubernetes cluster, follow
138+
the
139+
[OpenTelemetry Collector documentation](https://opentelemetry.io/docs/collector/)
140+
for more information.
141+
142+
To export data to a local OpenTelemetry Collector, set your OTLP endpoint to the
143+
OTLP http receiver port (default is `4318`):
144+
145+
```yaml
146+
apiVersion: toolhive.stacklok.dev/v1alpha1
147+
kind: MCPServer
148+
metadata:
149+
name: gofetch
150+
namespace: toolhive-system
151+
spec:
152+
...
153+
...
154+
telemetry:
155+
openTelemetry:
156+
enabled: true
157+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
158+
serviceName: mcp-fetch-server
159+
insecure: true
160+
metrics:
161+
enabled: true
162+
```
163+
164+
### Prometheus
165+
166+
To collect metrics using Prometheus, run your MCP server with the
167+
`spec.telemetry.prometheus.enabled` flag enabled and add the following to your
168+
Prometheus configuration:
169+
170+
```yaml title="prometheus.yml"
171+
scrape_configs:
172+
- job_name: 'toolhive-mcp-proxy'
173+
static_configs:
174+
- targets: ['<MCP_SERVER_PROXY_SVC_URL>:<MCP_SERVER_PORT>']
175+
scrape_interval: 15s
176+
metrics_path: /metrics
177+
```
178+
179+
You can add multiple MCP servers to the `targets` list. Replace
180+
`<MCP_SERVER_PROXY_SVC_URL>` with the ProxyRunner SVC name and
181+
`<MCP_SERVER_PORT>` with the port number exposed by the SVC.
182+
183+
### Jaeger
184+
185+
[Jaeger](https://www.jaegertracing.io) is a popular open-source distributed
186+
tracing system. You can run it inside of a Kubernetes cluster in order to store
187+
tracing telemetry data exported by the ToolHive proxy.
188+
189+
You can export traces to Jaeger by setting the OTLP endpoint to an OpenTelemetry
190+
collector, and then configuring the collector to export tracing data to Jaeger.
191+
192+
```yaml
193+
apiVersion: toolhive.stacklok.dev/v1alpha1
194+
kind: MCPServer
195+
metadata:
196+
name: gofetch
197+
namespace: toolhive-system
198+
spec:
199+
...
200+
...
201+
telemetry:
202+
openTelemetry:
203+
enabled: true
204+
endpoint: otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4318
205+
serviceName: mcp-fetch-server
206+
insecure: true
207+
tracing:
208+
enabled: true
209+
```
210+
211+
Inside of your OpenTelemetry collector configuration.
212+
213+
```yaml
214+
config:
215+
receivers:
216+
otlp:
217+
protocols:
218+
grpc:
219+
endpoint: 0.0.0.0:4317
220+
http:
221+
endpoint: 0.0.0.0:4318
222+
223+
exporters:
224+
otlp/jaeger:
225+
endpoint: http://jaeger-all-in-one-collector.monitoring:4317
226+
227+
service:
228+
pipelines:
229+
traces:
230+
receivers: [otlp]
231+
processors: [batch]
232+
exporters: [otlp/jaeger]
233+
```
234+
235+
### Honeycomb
236+
237+
Coming soon.
238+
239+
You'll need your Honeycomb API key, which you can find in your
240+
[Honeycomb account settings](https://ui.honeycomb.io/account).
241+
242+
### Datadog
243+
244+
Datadog has [multiple options](https://docs.datadoghq.com/opentelemetry/) for
245+
collecting OpenTelemetry data:
246+
247+
- The
248+
[**OpenTelemetry Collector**](https://docs.datadoghq.com/opentelemetry/setup/collector_exporter/)
249+
is recommended for existing OpenTelemetry users or users wanting a
250+
vendor-neutral solution.
251+
252+
- The [**Datadog Agent**](https://docs.datadoghq.com/opentelemetry/setup/agent)
253+
is recommended for existing Datadog users.
254+
255+
### Grafana Cloud
256+
257+
Coming soon.
258+
259+
## Performance considerations
260+
261+
### Sampling rates
262+
263+
Adjust sampling rates based on your environment:
264+
265+
- **Development**: `spec.telemetry.openTelemetry.tracing.samplingRate: 1.0`
266+
(100% sampling)
267+
- **Production**: `spec.telemetry.openTelemetry.tracing.samplingRate 0.01` (1%
268+
sampling for high-traffic systems)
269+
- **Default**: `spec.telemetry.openTelemetry.tracing.samplingRate 0.05` (5%
270+
sampling)
271+
272+
### Network overhead
273+
274+
Telemetry adds minimal overhead when properly configured:
275+
276+
- Use appropriate sampling rates for your traffic volume
277+
- Monitor your observability backend costs and adjust sampling accordingly
278+
279+
## Related information
280+
281+
- [Kubernetes CRD reference](../reference/crd-spec.mdx) - Reference for the
282+
`MCPServer` Custom Resource Definition (CRD)
283+
- [Deploy the operator using Helm](./deploy-operator-helm.md) - Install the
284+
ToolHive operator

sidebars.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,7 @@ const sidebars: SidebarsConfig = {
121121
'toolhive/guides-k8s/intro',
122122
'toolhive/guides-k8s/deploy-operator-helm',
123123
'toolhive/guides-k8s/run-mcp-k8s',
124+
'toolhive/guides-k8s/telemetry-and-metrics',
124125
'toolhive/reference/crd-spec',
125126
],
126127
},

0 commit comments

Comments
 (0)