feat(claude-self-obs): self-observability plugin for Claude Code (#150)

ANcpLua · claude · web-flow · commit c745d8c8a5b9 · 2026-02-28T23:28:11.000+01:00
* docs: add spec-0002 qyl Claude Code observability

Comprehensive spec for building Claude Code session observability
into qyl's AI telemetry dashboard. Zero-instrumentation approach
using Claude Code's native OTLP telemetry export (4 env vars).

Covers: OTLP data flow, DuckDB schema, 5 API endpoints, React
hooks, 4 dashboard components, SSE live streaming, and 4-phase
implementation plan. Correlation via prompt.id across all events.

Also fixes CLAUDE.md skill count (6 → 4 after council/invoke and
feature-dev/code-review skill deletions).

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* feat(qyl-instrumentation): rebuild as Teams API orchestration (v2.0.0)

Restructure from 3 standalone agents to 1 Opus captain + 4 Sonnet specialists.
Captain pre-reads otelwiki bundled semconv docs before spawning specialists —
eliminates runtime web search/fetch. New /observe command implements full
TeamCreate → spawn → cross-pollinate → synthesize → TeamDelete lifecycle.

New: opus-captain agent, qyl-platform-specialist agent, /observe command.
Changed: removed WebSearch/WebFetch from all specialist tools, added Team Protocol
sections with SendMessage coordination patterns.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* feat(qyl-instrumentation): competition-ready polish (v2.1.0)

Add hero scenario (proactive secretary), 8-layer trace example,
attribute decision tree, multi-turn agent traces, GenAI failure
modes, TypeSpec-to-dashboard flow, MCP/SSE patterns, SEMCONV_CONTEXT
shape, verification checklists, and example run walkthrough across
all 6 agent/command files. All stay under 250 lines.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* feat(claude-self-obs): add self-observability plugin for Claude Code

Every tool call (Read, Edit, Bash, Grep, WebSearch, Task, …) becomes
an OTLP span posted to the configured collector. Agent lifecycle events
(SubagentStart/Stop) emit trace boundary spans.

Zero config — silently no-ops when no collector is running, so it never
blocks the agent. Enable by starting any OTLP HTTP collector on :5100
(or set QYL_COLLECTOR_URL). Disable by stopping the collector.

Files:
- hooks/emit-span.sh       PostToolUse → OTLP span
- hooks/emit-agent-start.sh SubagentStart → agent/start span
- hooks/emit-agent-stop.sh  SubagentStop  → agent/stop span (linked to start)
- hooks/hooks.json          auto-loaded hook registration
- commands/status.md        /claude-self-obs:status command

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -30,6 +30,24 @@ and the project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
 
 ### Changed
 
+- **`qyl-instrumentation` (2.0.0 → 2.1.0)**: Competition-ready polish for all 6 agent/command files. Added hero scenario (proactive secretary notification handler), 8-layer trace example, attribute decision tree, performance profile, multi-turn agent trace, GenAI failure modes with 15-second async window, TypeSpec-to-dashboard end-to-end flow, MCP tool pattern, SSE consumption pattern, SEMCONV_CONTEXT shape, spawn/synthesis verification checklists, and example run walkthrough. All files stay under 250 lines
+
+### Added
+
+- **`docs/specs/spec-0002-qyl-claude-code-observability.md`**: Comprehensive spec for building Claude Code session observability into qyl's AI telemetry dashboard. Covers OTLP data flow (native `claude_code.*` metrics + events), DuckDB schema, 5 API endpoints, React hooks, 4 dashboard components, SSE live streaming, and 4-phase implementation plan. Zero-instrumentation approach — uses Claude Code's built-in OTLP telemetry export via 4 env vars
+- **`qyl-instrumentation/commands/observe.md`**: Teams API orchestration command — Opus captain pre-reads otelwiki bundled semconv docs, assembles SEMCONV_CONTEXT + SHARED_AWARENESS, spawns 4 Sonnet specialists in parallel, coordinates cross-pollination via SendMessage, synthesizes. Zero runtime web search
+- **`qyl-instrumentation/agents/opus-captain.md`**: Opus captain agent — orchestrates context assembly and team coordination, reads otelwiki docs before any specialist spawns
+- **`qyl-instrumentation/agents/qyl-platform-specialist.md`**: 4th Sonnet specialist covering MCP server, React dashboard, browser OTLP SDK, SSE streaming, and Copilot extensibility
+
+### Changed
+
+- **`qyl-instrumentation`**: Rebuilt from 3 standalone agents (v1.0.0) to Teams API orchestration (v2.0.0). 1 Opus captain + 4 Sonnet specialists. Captain pre-reads otelwiki bundled docs — specialists receive pre-assembled semconv context in spawn prompts instead of web searching at runtime
+- **`qyl-instrumentation` agents**: Removed `WebSearch` and `WebFetch` from all 3 existing specialist tool lists. Added Team Protocol sections documenting SendMessage coordination patterns and SEMCONV_CONTEXT injection
+- **`qyl-instrumentation/agents/otel-genai-architect.md`**: Convention verification now references captain's SEMCONV_CONTEXT instead of WebSearch
+- **`marketplace.json`**: Updated qyl-instrumentation description and version (1.0.0 → 2.0.0), agent count 17 → 19, command count 23 → 24
+
+### Changed
+
 - **`exodia/skills/hades`**: Migrated from vague Teams references to explicit Teams API. SKILL.md now uses `TeamCreate`, `TeamDelete`, `SendMessage` (shutdown_request/shutdown_response), `TaskCreate`/`TaskList`/`TaskUpdate` with explicit parameters. Removed fallback subagent path and duplicate STEP -1 block. All 4 teammate templates (auditors, eliminators, verifiers, goggles) updated: vague `MESSAGE` → `SendMessage (recipient: "...")`, vague task list → `TaskCreate`/`TaskUpdate`, team context preamble and shutdown protocol added
 - **`exodia/eight-gates` Gate 7 EXECUTE**: Removed dual Mode A (Task subagents) / Mode B (Agent Teams) pattern. Teams API is now the single execution mode. Lane workers coordinate via `SendMessage` and claim work via `TaskCreate`/`TaskUpdate`. Collision avoidance uses teammate messaging
 - **`exodia/skills/hades` allowed-tools**: Added `TeamCreate`, `TeamDelete`, `TaskCreate`, `TaskList`, `TaskUpdate`, `SendMessage` to frontmatter
diff --git a/plugins/claude-self-obs/.claude-plugin/plugin.json b/plugins/claude-self-obs/.claude-plugin/plugin.json
@@ -0,0 +1,21 @@
+{
+  "name": "claude-self-obs",
+  "version": "1.0.0",
+  "description": "Self-observability for Claude Code: every tool call becomes an OTLP span. Watch AI agents build software in real time. Zero config — silently no-ops when no collector is running.",
+  "author": {
+    "name": "ANcpLua",
+    "url": "https://github.com/ANcpLua"
+  },
+  "repository": "https://github.com/ANcpLua/ancplua-claude-plugins",
+  "license": "MIT",
+  "keywords": [
+    "opentelemetry",
+    "otlp",
+    "observability",
+    "tracing",
+    "hooks",
+    "claude-code",
+    "self-observability"
+  ],
+  "commands": "./commands"
+}
diff --git a/plugins/claude-self-obs/README.md b/plugins/claude-self-obs/README.md
@@ -0,0 +1,55 @@
+# claude-self-obs
+
+**Watch AI agents build software in real time.**
+
+Every Claude Code tool call (Read, Edit, Bash, Grep, WebSearch, …) becomes an OTLP span.
+Agent lifecycle events (spawn, stop) become trace boundaries.
+Everything flows to your OTLP collector — zero config, zero code changes.
+
+## How it works
+
+```
+Claude Code tool call
+  → PostToolUse hook fires
+  → emit-span.sh wraps it as OTLP ExportTraceServiceRequest
+  → POST to localhost:5100/v1/traces
+  → Collector stores + streams to dashboard
+```
+
+**Enable:** start your OTLP collector (qyl, Jaeger, any OTLP HTTP endpoint).
+**Disable:** stop the collector. Hook silently no-ops — never blocks the agent.
+
+## Signals captured
+
+| Hook | Span name | Key attributes |
+|------|-----------|----------------|
+| PostToolUse (Read) | `tool/Read` | `file.path` |
+| PostToolUse (Edit) | `tool/Edit` | `file.path` |
+| PostToolUse (Bash) | `tool/Bash` | `bash.command` |
+| PostToolUse (Grep) | `tool/Grep` | `search.pattern` |
+| PostToolUse (WebSearch) | `tool/WebSearch` | `search.query` |
+| PostToolUse (Task) | `tool/Task` | `task.subagent_type`, `task.prompt` |
+| SubagentStart | `agent/start:{name}` | `agent.name`, `agent.type` |
+| SubagentStop | `agent/stop:{name}` | `agent.name`, `agent.type`, `agent.id` |
+
+## Trace model
+
+All spans in a session share one `traceId` (derived from `session_id`).
+Agent start/stop spans are parent/child pairs.
+Tool call spans are flat (no timing yet — `startTime == endTime`).
+
+## Commands
+
+| Command | What it does |
+|---------|-------------|
+| `/claude-self-obs:status` | Check if collector is reachable |
+
+## Configuration
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `QYL_COLLECTOR_URL` | `http://localhost:5100` | OTLP HTTP endpoint base URL |
+
+## Dependencies
+
+`curl`, `jq`, `python3` — all pre-installed on macOS.
diff --git a/plugins/claude-self-obs/commands/status.md b/plugins/claude-self-obs/commands/status.md
@@ -0,0 +1,45 @@
+---
+description: Check whether the OTLP collector is reachable and self-observability is active.
+---
+
+# /claude-self-obs:status
+
+Check if the claude-self-obs plugin is actively sending spans.
+
+## What this does
+
+1. Reads `$QYL_COLLECTOR_URL` (default: `http://localhost:5100`)
+2. Attempts a health check against the collector
+3. Reports: **active** (spans flowing) or **standby** (silently dropping, collector unreachable)
+4. Shows the last few span names received if the collector has a sessions API
+
+## Steps
+
+Run this in a Bash tool:
+
+```bash
+COLLECTOR="${QYL_COLLECTOR_URL:-http://localhost:5100}"
+
+echo "Checking collector at $COLLECTOR..."
+
+if curl -sf "$COLLECTOR/health" > /dev/null 2>&1 \
+   || curl -sf "$COLLECTOR/api/v1/claude-code/sessions" > /dev/null 2>&1; then
+  echo "✓ ACTIVE — spans are flowing to $COLLECTOR"
+  echo ""
+  echo "Recent sessions:"
+  curl -sf "$COLLECTOR/api/v1/claude-code/sessions" | jq -r '.[] | "  \(.session_id[:8])...  \(.tool_count) spans"' 2>/dev/null || true
+else
+  echo "◌ STANDBY — collector unreachable at $COLLECTOR"
+  echo "  Spans are silently dropped. Start qyl to begin collecting."
+  echo "  To override URL: export QYL_COLLECTOR_URL=http://your-collector:port"
+fi
+```
+
+## Enable / Disable
+
+| Action | How |
+|--------|-----|
+| **Enable** | Start qyl collector (`dotnet run` in the qyl project) |
+| **Disable** | Stop the collector — hook silently no-ops |
+| **Change URL** | `export QYL_COLLECTOR_URL=http://other-host:5100` |
+| **Uninstall** | Disable plugin in Claude Code settings |
diff --git a/plugins/claude-self-obs/hooks/emit-agent-start.sh b/plugins/claude-self-obs/hooks/emit-agent-start.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+# emit-agent-start.sh — SubagentStart hook
+# Creates an "agent/start" span when a subagent is spawned.
+
+set -euo pipefail
+
+COLLECTOR_URL="${QYL_COLLECTOR_URL:-http://localhost:5100}/v1/traces"
+
+INPUT=$(cat)
+
+SESSION_ID=$(echo "$INPUT" | jq -r '.session_id  // "unknown"')
+AGENT_NAME=$(echo "$INPUT" | jq -r '.agent_name  // "unknown"')
+AGENT_TYPE=$(echo "$INPUT" | jq -r '.agent_type  // ""')
+CWD=$(echo "$INPUT"        | jq -r '.cwd         // ""')
+
+TRACE_ID=$(printf '%s' "$SESSION_ID" | md5 -q 2>/dev/null \
+        || printf '%s' "$SESSION_ID" | md5sum | cut -c1-32)
+TRACE_ID="${TRACE_ID:0:32}"
+
+# Span ID from session+agent+start (deterministic, unique per agent spawn)
+SPAN_KEY="${SESSION_ID}:agent_start:${AGENT_NAME}"
+SPAN_ID=$(printf '%s' "$SPAN_KEY" | md5 -q 2>/dev/null \
+       || printf '%s' "$SPAN_KEY" | md5sum | cut -c1-16)
+SPAN_ID="${SPAN_ID:0:16}"
+
+NOW_NS=$(python3 -c "import time; print(int(time.time() * 1e9))" 2>/dev/null \
+      || date +%s000000000)
+
+OTLP_PAYLOAD=$(jq -n \
+  --arg trace_id "$TRACE_ID" --arg span_id "$SPAN_ID" \
+  --arg agent "$AGENT_NAME" --arg type "$AGENT_TYPE" \
+  --arg session "$SESSION_ID" --arg cwd "$CWD" --arg now_ns "$NOW_NS" \
+  '{
+    resourceSpans: [{
+      resource: { attributes: [
+        { key: "service.name", value: { stringValue: "claude-code" } },
+        { key: "session.id",   value: { stringValue: $session } },
+        { key: "process.cwd",  value: { stringValue: $cwd } }
+      ]},
+      scopeSpans: [{
+        scope: { name: "claude-code.hooks", version: "1.0.0" },
+        spans: [{
+          traceId: $trace_id, spanId: $span_id,
+          name: ("agent/start:" + $agent), kind: 1,
+          startTimeUnixNano: $now_ns, endTimeUnixNano: $now_ns,
+          attributes: [
+            { key: "agent.name", value: { stringValue: $agent } },
+            { key: "agent.type", value: { stringValue: $type } },
+            { key: "event",      value: { stringValue: "SubagentStart" } }
+          ],
+          status: { code: 1 }
+        }]
+      }]
+    }]
+  }')
+
+curl -s -X POST "$COLLECTOR_URL" \
+  -H "Content-Type: application/json" \
+  -d "$OTLP_PAYLOAD" \
+  --max-time 2 > /dev/null 2>&1 || true
diff --git a/plugins/claude-self-obs/hooks/emit-agent-stop.sh b/plugins/claude-self-obs/hooks/emit-agent-stop.sh
@@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+# emit-agent-stop.sh — SubagentStop hook
+# Creates an "agent/stop" span when a subagent finishes.
+
+set -euo pipefail
+
+COLLECTOR_URL="${QYL_COLLECTOR_URL:-http://localhost:5100}/v1/traces"
+
+INPUT=$(cat)
+
+SESSION_ID=$(echo "$INPUT" | jq -r '.session_id  // "unknown"')
+AGENT_NAME=$(echo "$INPUT" | jq -r '.agent_name  // "unknown"')
+AGENT_TYPE=$(echo "$INPUT" | jq -r '.agent_type  // ""')
+AGENT_ID=$(echo "$INPUT"   | jq -r '.agent_id    // ""')
+CWD=$(echo "$INPUT"        | jq -r '.cwd         // ""')
+
+TRACE_ID=$(printf '%s' "$SESSION_ID" | md5 -q 2>/dev/null \
+        || printf '%s' "$SESSION_ID" | md5sum | cut -c1-32)
+TRACE_ID="${TRACE_ID:0:32}"
+
+SPAN_KEY="${SESSION_ID}:agent_stop:${AGENT_NAME}:${AGENT_ID}"
+SPAN_ID=$(printf '%s' "$SPAN_KEY" | md5 -q 2>/dev/null \
+       || printf '%s' "$SPAN_KEY" | md5sum | cut -c1-16)
+SPAN_ID="${SPAN_ID:0:16}"
+
+# Parent span = the start span for this agent
+PARENT_KEY="${SESSION_ID}:agent_start:${AGENT_NAME}"
+PARENT_ID=$(printf '%s' "$PARENT_KEY" | md5 -q 2>/dev/null \
+         || printf '%s' "$PARENT_KEY" | md5sum | cut -c1-16)
+PARENT_ID="${PARENT_ID:0:16}"
+
+NOW_NS=$(python3 -c "import time; print(int(time.time() * 1e9))" 2>/dev/null \
+      || date +%s000000000)
+
+OTLP_PAYLOAD=$(jq -n \
+  --arg trace_id "$TRACE_ID" --arg span_id "$SPAN_ID" --arg parent_id "$PARENT_ID" \
+  --arg agent "$AGENT_NAME" --arg type "$AGENT_TYPE" --arg agent_id "$AGENT_ID" \
+  --arg session "$SESSION_ID" --arg cwd "$CWD" --arg now_ns "$NOW_NS" \
+  '{
+    resourceSpans: [{
+      resource: { attributes: [
+        { key: "service.name", value: { stringValue: "claude-code" } },
+        { key: "session.id",   value: { stringValue: $session } },
+        { key: "process.cwd",  value: { stringValue: $cwd } }
+      ]},
+      scopeSpans: [{
+        scope: { name: "claude-code.hooks", version: "1.0.0" },
+        spans: [{
+          traceId: $trace_id, spanId: $span_id, parentSpanId: $parent_id,
+          name: ("agent/stop:" + $agent), kind: 1,
+          startTimeUnixNano: $now_ns, endTimeUnixNano: $now_ns,
+          attributes: [
+            { key: "agent.name", value: { stringValue: $agent } },
+            { key: "agent.type", value: { stringValue: $type } },
+            { key: "agent.id",   value: { stringValue: $agent_id } },
+            { key: "event",      value: { stringValue: "SubagentStop" } }
+          ],
+          status: { code: 1 }
+        }]
+      }]
+    }]
+  }')
+
+curl -s -X POST "$COLLECTOR_URL" \
+  -H "Content-Type: application/json" \
+  -d "$OTLP_PAYLOAD" \
+  --max-time 2 > /dev/null 2>&1 || true
diff --git a/plugins/claude-self-obs/hooks/emit-span.sh b/plugins/claude-self-obs/hooks/emit-span.sh
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+# emit-span.sh — PostToolUse hook
+# Transforms Claude Code tool calls into OTLP spans and POSTs to the collector.
+# Silently no-ops when collector is unreachable. Never blocks the agent.
+
+set -euo pipefail
+
+COLLECTOR_URL="${QYL_COLLECTOR_URL:-http://localhost:5100}/v1/traces"
+
+INPUT=$(cat)
+
+SESSION_ID=$(echo "$INPUT" | jq -r '.session_id  // "unknown"')
+TOOL_NAME=$(echo "$INPUT"  | jq -r '.tool_name   // "unknown"')
+TOOL_USE_ID=$(echo "$INPUT" | jq -r '.tool_use_id // "unknown"')
+CWD=$(echo "$INPUT"        | jq -r '.cwd         // ""')
+AGENT_NAME=$(echo "$INPUT" | jq -r '.agent_name  // ""')
+AGENT_TYPE=$(echo "$INPUT" | jq -r '.agent_type  // ""')
+
+# Derive traceId deterministically from session_id (one trace per session)
+TRACE_ID=$(printf '%s' "$SESSION_ID" | md5 -q 2>/dev/null \
+        || printf '%s' "$SESSION_ID" | md5sum | cut -c1-32)
+TRACE_ID="${TRACE_ID:0:32}"
+
+# Derive spanId from tool_use_id (unique per tool call)
+SPAN_ID=$(printf '%s' "$TOOL_USE_ID" | md5 -q 2>/dev/null \
+       || printf '%s' "$TOOL_USE_ID" | md5sum | cut -c1-16)
+SPAN_ID="${SPAN_ID:0:16}"
+
+NOW_NS=$(python3 -c "import time; print(int(time.time() * 1e9))" 2>/dev/null \
+      || date +%s000000000)
+
+TOOL_ATTRS=$(echo "$INPUT" | jq -c '[
+  if .tool_input.file_path     then { key: "file.path",           value: { stringValue: .tool_input.file_path } }               else empty end,
+  if .tool_input.command       then { key: "bash.command",        value: { stringValue: (.tool_input.command | .[0:500]) } }     else empty end,
+  if .tool_input.pattern       then { key: "search.pattern",      value: { stringValue: .tool_input.pattern } }                 else empty end,
+  if .tool_input.query         then { key: "search.query",        value: { stringValue: .tool_input.query } }                   else empty end,
+  if .tool_input.url           then { key: "http.url",            value: { stringValue: .tool_input.url } }                     else empty end,
+  if .tool_input.content       then { key: "file.size_bytes",     value: { intValue: (.tool_input.content | length | tostring) } } else empty end,
+  if .tool_input.prompt        then { key: "task.prompt",         value: { stringValue: (.tool_input.prompt | .[0:200]) } }     else empty end,
+  if .tool_input.subagent_type then { key: "task.subagent_type",  value: { stringValue: .tool_input.subagent_type } }           else empty end
+]')
+
+AGENT_ATTRS=$(jq -cn \
+  --arg name "$AGENT_NAME" --arg type "$AGENT_TYPE" \
+  '[
+    if $name != "" then { key: "agent.name", value: { stringValue: $name } } else empty end,
+    if $type != "" then { key: "agent.type", value: { stringValue: $type } } else empty end
+  ]')
+
+ALL_ATTRS=$(jq -cn \
+  --arg tool "$TOOL_NAME" \
+  --argjson tool_attrs "$TOOL_ATTRS" \
+  --argjson agent_attrs "$AGENT_ATTRS" \
+  '[{ key: "tool.name", value: { stringValue: $tool } }] + $tool_attrs + $agent_attrs')
+
+OTLP_PAYLOAD=$(jq -n \
+  --arg trace_id "$TRACE_ID" --arg span_id "$SPAN_ID" \
+  --arg tool "$TOOL_NAME" --arg session "$SESSION_ID" --arg cwd "$CWD" \
+  --arg now_ns "$NOW_NS" --argjson attrs "$ALL_ATTRS" \
+  '{
+    resourceSpans: [{
+      resource: { attributes: [
+        { key: "service.name", value: { stringValue: "claude-code" } },
+        { key: "session.id",   value: { stringValue: $session } },
+        { key: "process.cwd",  value: { stringValue: $cwd } }
+      ]},
+      scopeSpans: [{
+        scope: { name: "claude-code.hooks", version: "1.0.0" },
+        spans: [{
+          traceId: $trace_id, spanId: $span_id,
+          name: ("tool/" + $tool), kind: 3,
+          startTimeUnixNano: $now_ns, endTimeUnixNano: $now_ns,
+          attributes: $attrs, status: { code: 1 }
+        }]
+      }]
+    }]
+  }')
+
+curl -s -X POST "$COLLECTOR_URL" \
+  -H "Content-Type: application/json" \
+  -d "$OTLP_PAYLOAD" \
+  --max-time 2 > /dev/null 2>&1 || true
diff --git a/plugins/claude-self-obs/hooks/hooks.json b/plugins/claude-self-obs/hooks/hooks.json