Skip to content

Commit 0ff10ec

Browse files
authored
docs: Add logging infrastructure guide for Kubernetes deployments (#167)
- Add comprehensive logging infrastructure documentation covering standard and audit logs - Document log format specifications with field definitions - Include configuration examples for operator and MCPServer resources - Add log collection strategies and enterprise tool integration (ELK, Splunk, Datadog) - Update sidebar navigation to include the new guide - Add contextual reference from deploy-operator-helm.md Addresses: stacklok/toolhive-platform#39 Signed-off-by: Juan Antonio Osorio <[email protected]>
1 parent 63e5855 commit 0ff10ec

File tree

3 files changed

+326
-1
lines changed

3 files changed

+326
-1
lines changed

docs/toolhive/guides-k8s/deploy-operator-helm.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ kubectl logs -f -n toolhive-system <TOOLHIVE_OPERATOR_POD_NAME>
7070
```
7171

7272
This shows you the logs of the operator pod, which can help you debug any
73-
issues.
73+
issues. For comprehensive logging and audit capabilities, see the
74+
[Logging infrastructure](./logging-infrastructure.md) guide.
7475

7576
## Customize the operator
7677

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
---
2+
title: Logging infrastructure
3+
description:
4+
Configure and manage logging for ToolHive in Kubernetes environments
5+
sidebar_label: Logging infrastructure
6+
---
7+
8+
The ToolHive Kubernetes operator provides comprehensive logging capabilities for
9+
monitoring, auditing, and troubleshooting MCP servers in production
10+
environments. This guide focuses on the essential pieces: log formats, minimal
11+
configuration, and simple integrations.
12+
13+
## Overview
14+
15+
ToolHive Operator provides two types of logs:
16+
17+
1. **Standard application logs** - Structured operational logs from the ToolHive
18+
operator and proxy components
19+
2. **Audit logs** - Security and compliance logs tracking all MCP operations
20+
21+
```mermaid
22+
flowchart TB
23+
subgraph Sources["Log Sources"]
24+
Op["ToolHive Operator"]
25+
Proxy["HTTP Proxy Pods"]
26+
MCP["MCP Server Pods"]
27+
end
28+
29+
subgraph Processing["Log Processing"]
30+
Stdout["Standard Output<br>(Structured JSON)"]
31+
Audit["Audit Logger<br>(Structured JSON)"]
32+
end
33+
34+
subgraph Destinations["Log Destinations"]
35+
K8s["Kubernetes Logs"]
36+
File["Audit Log Files"]
37+
ELK["ELK Stack"]
38+
Splunk["Splunk"]
39+
Datadog["Datadog"]
40+
end
41+
42+
Op --> Stdout
43+
Proxy --> Stdout & Audit
44+
MCP --> Stdout
45+
46+
Stdout --> K8s
47+
Audit --> File & K8s
48+
49+
K8s --> ELK & Splunk & Datadog
50+
```
51+
52+
## Log formats and content specifications
53+
54+
### Standard application logs
55+
56+
ToolHive uses structured JSON logging. All logs are output in a consistent JSON
57+
format for easy parsing and analysis.
58+
59+
#### Log format specification
60+
61+
```json
62+
{
63+
"level": "info",
64+
"ts": 1704067200.123456,
65+
"caller": "controllers/mcpserver_controller.go:123",
66+
"msg": "Starting MCP server",
67+
"server": "github",
68+
"transport": "sse",
69+
"container": "thv-github-abc123",
70+
"namespace": "default",
71+
"version": "0.1.0"
72+
}
73+
```
74+
75+
#### Field definitions
76+
77+
| Field | Type | Description |
78+
| ----------- | ------ | ------------------------------------------------- |
79+
| `level` | string | Log level: `debug`, `info`, `warn`, `error` |
80+
| `ts` | float | Unix timestamp with microseconds |
81+
| `caller` | string | Source code location |
82+
| `msg` | string | Log message |
83+
| `server` | string | MCP server name |
84+
| `transport` | string | Transport type: `stdio`, `sse`, `streamable-http` |
85+
| `container` | string | Container name |
86+
| `namespace` | string | Kubernetes namespace |
87+
| `version` | string | Component version |
88+
89+
### Audit logs
90+
91+
Audit logs provide detailed, structured records of all MCP operations for
92+
security and compliance purposes. When audit is enabled, the ToolHive proxy
93+
generates structured audit events for every MCP operation.
94+
95+
#### Audit log format specification
96+
97+
```json
98+
{
99+
"time": "2024-01-01T12:00:00.123456789Z",
100+
"level": "INFO+2",
101+
"msg": "audit_event",
102+
"audit_id": "550e8400-e29b-41d4-a716-446655440000",
103+
"type": "mcp_tool_call",
104+
"logged_at": "2024-01-01T12:00:00.123456Z",
105+
"outcome": "success",
106+
"component": "github-server",
107+
"source": {
108+
"type": "network",
109+
"value": "10.0.1.5",
110+
"extra": {
111+
"user_agent": "node"
112+
}
113+
},
114+
"subjects": {
115+
"user": "[email protected]",
116+
"user_id": "user-123"
117+
},
118+
"target": {
119+
"endpoint": "/messages",
120+
"method": "tools/call",
121+
"name": "search_issues",
122+
"type": "tool"
123+
},
124+
"metadata": {
125+
"extra": {
126+
"duration_ms": 245,
127+
"transport": "http"
128+
}
129+
}
130+
}
131+
```
132+
133+
#### Audit field definitions
134+
135+
| Field | Type | Description |
136+
| ---------------------------- | ------ | ----------------------------------------------------------------------- |
137+
| `time` | string | Timestamp when the log was generated |
138+
| `level` | string | Log level (INFO+2 for audit events) |
139+
| `msg` | string | Always "audit_event" for audit logs |
140+
| `audit_id` | string | Unique identifier for the audit event |
141+
| `type` | string | Type of MCP operation (see event types below) |
142+
| `logged_at` | string | UTC timestamp of the event |
143+
| `outcome` | string | Result of the operation: `success` or `failure` |
144+
| `component` | string | Name of the MCP server |
145+
| `source` | object | Request source information |
146+
| `source.type` | string | Source type (e.g., "network") |
147+
| `source.value` | string | Source identifier (e.g., IP address) |
148+
| `source.extra` | object | Additional source metadata |
149+
| `subjects` | object | User/identity information |
150+
| `subjects.user` | string | User display name (from JWT claims: name, preferred_username, or email) |
151+
| `subjects.user_id` | string | User identifier (from JWT sub claim) |
152+
| `subjects.client_name` | string | Optional: Client application name (if present in JWT claims) |
153+
| `subjects.client_version` | string | Optional: Client version (if present in JWT claims) |
154+
| `target` | object | Target resource information |
155+
| `target.endpoint` | string | API endpoint path |
156+
| `target.method` | string | MCP method called |
157+
| `target.name` | string | Tool/resource name |
158+
| `target.type` | string | Target type (e.g., "tool") |
159+
| `metadata` | object | Additional metadata |
160+
| `metadata.extra.duration_ms` | number | Operation duration in milliseconds |
161+
| `metadata.extra.transport` | string | Transport protocol used |
162+
163+
#### Audit event types
164+
165+
| Event Type | Description |
166+
| -------------------- | ------------------------- |
167+
| `mcp_initialize` | MCP server initialization |
168+
| `mcp_tool_call` | Tool execution request |
169+
| `mcp_tools_list` | List available tools |
170+
| `mcp_resource_read` | Resource access |
171+
| `mcp_resources_list` | List available resources |
172+
| `mcp_prompt_get` | Prompt retrieval |
173+
| `mcp_prompts_list` | List available prompts |
174+
| `mcp_notification` | MCP notifications |
175+
| `mcp_ping` | Health check pings |
176+
| `mcp_completion` | Request completion |
177+
178+
## Configuration
179+
180+
### Operator-level logging
181+
182+
Configure logging for the ToolHive operator in the Helm values:
183+
184+
```yaml
185+
# values.yaml
186+
operator:
187+
# Log level is controlled by the debug flag
188+
debug: false # Production: use info level (set to true for debug level)
189+
```
190+
191+
### MCPServer logging configuration
192+
193+
Configure audit logging for individual MCP servers in the MCPServer resource:
194+
195+
```yaml
196+
apiVersion: mcp.toolhive.io/v1alpha1
197+
kind: MCPServer
198+
metadata:
199+
name: github-server
200+
spec:
201+
image: ghcr.io/stacklok/toolhive/servers/github:latest
202+
203+
# Audit logging configuration
204+
audit:
205+
enabled: true # Audit logs are output to stdout alongside standard logs
206+
```
207+
208+
:::info Audit logs are output to stdout alongside standard application logs. Log
209+
collectors can differentiate between standard and audit logs by checking for the
210+
presence of the `audit_id` field.
211+
212+
Note: User information in the `subjects` field is populated from JWT claims when
213+
OIDC authentication is configured. The system uses the `name`,
214+
`preferred_username`, or `email` claim (in that order) for the display name.
215+
Without authentication middleware, the user appears as "anonymous". :::
216+
217+
## Minimal collectors
218+
219+
The examples below show minimal configurations only. They assume a standard
220+
Kubernetes setup where container stdout/stderr is written to files under
221+
/var/log/containers/. They do not include deployment manifests.
222+
223+
### Fluentd (minimal)
224+
225+
```text
226+
# fluentd.conf
227+
<source>
228+
@type tail
229+
path /var/log/containers/*toolhive*.log
230+
tag toolhive
231+
read_from_head true
232+
<parse>
233+
@type json
234+
time_key time
235+
time_format %Y-%m-%dT%H:%M:%S.%NZ
236+
</parse>
237+
</source>
238+
239+
# Route standard logs
240+
<match toolhive>
241+
@type elasticsearch
242+
host elasticsearch.logging.svc.cluster.local
243+
port 9200
244+
index_name toolhive
245+
</match>
246+
247+
# Route audit logs (entries that contain audit_id) to a separate index
248+
<filter toolhive>
249+
@type grep
250+
<regexp>
251+
key audit_id
252+
pattern .+
253+
</regexp>
254+
@label @AUDIT
255+
</filter>
256+
257+
<label @AUDIT>
258+
<match **>
259+
@type elasticsearch
260+
host elasticsearch.logging.svc.cluster.local
261+
port 9200
262+
index_name toolhive-audit
263+
</match>
264+
</label>
265+
```
266+
267+
### Filebeat (minimal)
268+
269+
```yaml
270+
filebeat.inputs:
271+
- type: container
272+
paths:
273+
- /var/log/containers/*toolhive*.log
274+
json.keys_under_root: true
275+
json.add_error_key: true
276+
277+
output.elasticsearch:
278+
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
279+
indices:
280+
- index: 'toolhive-audit-%{+yyyy.MM.dd}'
281+
when.has_fields: ['audit_id']
282+
- index: 'toolhive-%{+yyyy.MM.dd}'
283+
```
284+
285+
### Splunk (minimal)
286+
287+
```ini
288+
# inputs.conf
289+
[monitor:///var/log/containers/*toolhive*]
290+
sourcetype = _json
291+
index = toolhive
292+
293+
# props.conf
294+
[_json]
295+
KV_MODE = json
296+
SHOULD_LINEMERGE = false
297+
TRANSFORMS-route_audit = route_audit
298+
299+
# transforms.conf
300+
[route_audit]
301+
REGEX = "audit_id":\s*".+"
302+
DEST_KEY = _MetaData:Index
303+
FORMAT = toolhive_audit
304+
```
305+
306+
## Best practices
307+
308+
### Security considerations
309+
310+
- Encrypt audit logs at rest and in transit.
311+
- Implement RBAC to restrict access to pod logs.
312+
313+
```yaml
314+
apiVersion: rbac.authorization.k8s.io/v1
315+
kind: Role
316+
metadata:
317+
name: log-reader
318+
namespace: toolhive-system
319+
rules:
320+
- apiGroups: ['']
321+
resources: ['pods/log']
322+
verbs: ['get', 'list']
323+
```

sidebars.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ const sidebars: SidebarsConfig = {
123123
'toolhive/guides-k8s/deploy-operator-helm',
124124
'toolhive/guides-k8s/run-mcp-k8s',
125125
'toolhive/guides-k8s/telemetry-and-metrics',
126+
'toolhive/guides-k8s/logging-infrastructure',
126127
'toolhive/reference/crd-spec',
127128
],
128129
},

0 commit comments

Comments
 (0)