Usage Metrics Collection (Phase 1 + 2): Track workflow costs per task and stage by landovsky · Pull Request #6 · landovsky/claude-golem

landovsky · 2026-02-03T20:11:23Z

Summary

Implements complete usage metrics collection for the development workflow. Tracks token usage, costs, and duration for each workflow stage (analyst, planner, implementer, reviewer) with git-persisted JSONL storage and human-readable beads comments.

Addresses: Complete blindness to workflow costs (9-10/10 pain identified in validation)

Implementation

Phase 1: Infrastructure

✅ SubagentStart/SubagentStop hook scripts
✅ JSONL persistence (.claude/workflow-metrics.jsonl)
✅ BD comment integration for per-task summaries
✅ Portable date handling (macOS/Linux compatibility)
✅ Robust error handling (exit 0 always, never breaks workflow)

Phase 2: Real Token Data

✅ Agent transcript parsing for token counts
✅ Model-specific cost calculation (5 pricing tiers: Opus 4.5, Opus 4, Sonnet 4, Haiku 4.5, Haiku 3.5)
✅ Real usage data: input/output tokens, cache read/creation
✅ Model name extraction from API responses
✅ Backward compatible (same JSONL schema, graceful fallback)

Use Cases Enabled

Benchmarking: Compare workflow vs fast-track costs to decide if overhead is worth it
Token leak detection: Spot anomalies (e.g., task used 5x normal tokens) → investigate bugs
Understanding distribution: Which stages cost most? → inform optimization decisions

Data Format

JSONL Entry:

{
  "event": "stage_end",
  "timestamp": "2026-02-03T14:25:32Z",
  "stage": "planner",
  "task": "Plan OTLP data integration",
  "duration_seconds": 107,
  "tokens": {
    "input": 12500,
    "output": 3200,
    "cache_read": 8000,
    "cache_creation": 15000
  },
  "cost_usd": 0.0523,
  "model": "claude-sonnet-4-5-20250929"
}

Query Examples:

# Total cost by stage
jq -s 'group_by(.stage) | map({stage: .[0].stage, total_cost: (map(.cost_usd // 0) | add)})' .claude/workflow-metrics.jsonl

# Average duration per stage
jq -s 'group_by(.stage) | map({stage: .[0].stage, avg_duration: (map(.duration_seconds) | add / length)})' .claude/workflow-metrics.jsonl

# Most expensive tasks
jq 'select(.event=="stage_end")' .claude/workflow-metrics.jsonl | jq -s 'sort_by(-.cost_usd) | .[0:10]'

Files Changed

New:

.claude/hooks/metrics-start.sh - Captures stage start events
.claude/hooks/metrics-end.sh - Captures end events with duration, tokens, cost
.claude/analysis/.claude-6t5.1-context.md - Phase 2 analyst context
.claude/specs/.claude-6t5.1-spec.md - Phase 2 specification

Modified:

.claude/agents/master.md - Added $TASK env var setup instructions
.claude/commands/develop.md - Added usage metrics documentation
artifacts/lessons-learned.md - Captured hook patterns, jq usage, transcript parsing

Note: settings.json not included (user-specific). Users must manually add hooks (documented in develop.md).

Testing

Phase 1:

✅ Unit tests (malformed JSON, missing TASK, truncation)
✅ Integration test (duration calculation: 19 seconds measured)
✅ Edge cases (blocked status, session interruption)

Phase 2:

✅ Token parsing (250 in / 125 out / 500 cache validated)
✅ Cost calculation ($0.0050 verified with manual math)
✅ Model pricing patterns (all 5 tiers + fallback)
✅ Edge cases (empty transcript, missing file, unknown model, large numbers)

Validation Process

Validated through /validate workflow
Challenge Q&A confirmed high ROI (high value, small effort, low risk)
Scoped to Phase 1+2 only (no dashboards/automation yet)
Full workflow execution: analyst → planner → implementer → reviewer (both phases)

Known Limitations

Fast-track tasks (master only) not tracked - only subagent stages
Concurrent subagents not tested - sequential workflow only
Requires manual settings.json update per user

Merge Checklist

All acceptance criteria met
Tests passed (unit + integration)
Code reviewed and approved
Documentation complete
Backward compatible (Phase 1 → Phase 2 seamless)
No breaking changes

Related Issues

Closes .claude-l0a (Phase 1: Usage data collection infrastructure)
Closes .claude-6t5 (Phase 2: Real OTLP token data integration)

🤖 Generated through full /develop workflow with analyst, planner, implementer, and reviewer agents.

Implements SubagentStart and SubagentStop hooks to capture workflow stage execution metrics. The hooks write to workflow-metrics.jsonl with full schema including mock token fields for Phase 1. Key features: - Portable date handling (BSD/GNU date compatibility) - Graceful error handling (exit 0 always) - TASK truncation to 50 chars - Duration calculation via timestamp correlation - Blocked status detection from transcript Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Updates master agent instructions to set TASK env var before subagent invocations, enabling metrics collection per workflow stage. Adds usage metrics documentation to develop.md command, including: - Data storage locations (JSONL + bd comments) - Query examples for analyzing metrics - Phase 1 limitations (mock token values, no fast-track tracking) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

All workflow tasks completed (.claude-l0a and subtasks): - Analyst validated technical approach and produced spec - Planner designed implementation with portable date handling - Implementer created hook scripts and documentation - Reviewer approved with no critical issues Changes: - Added analyst context document for future reference - Updated lessons-learned with hook patterns and jq usage - Removed test data from workflow-metrics.jsonl - Closed all beads tasks (analyst, planner, implementer, reviewer) Implementation is ready for use. Users need to manually add hooks to their local settings.json (documented in develop.md). Phase 2 will add real OTLP token data to replace mock values. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implements Phase 2 OTLP integration to extract real token counts and costs from subagent transcript files. Changes: - Extract agent_transcript_path from hook input JSON - Parse transcript JSONL to sum tokens across all API calls - input_tokens, output_tokens, cache_read_input_tokens - cache_creation (ephemeral_5m + ephemeral_1h) - Extract model name from last assistant message - Calculate cost_usd using model-specific pricing table - Map model patterns to pricing tiers (opus/sonnet/haiku 4.x/3.5) - Update JSON generation to use real values instead of nulls - Update bd comment format to display real metrics Fallback behavior: - If agent_transcript_path missing/invalid: fall back to null/zero - If transcript parsing fails: fall back to null/zero - If model unknown: use sonnet-4 pricing (conservative) - Zero tokens converted to null in JSON (maintains Phase 1 schema) Technical details: - Uses jq for safe token summing (handles large numbers) - Uses awk for floating point cost calculation (4 decimal places) - Maintains exit 0 always pattern (never breaks workflow) - Preserves JSONL schema from Phase 1 (drop-in replacement) Related: .claude-6t5.3 (Implement OTLP integration) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

All workflow tasks completed (.claude-6t5 and subtasks): - Analyst discovered transcript parsing approach (OTLP not accessible) - Planner designed transcript parsing with cost calculation - Implementer added real token extraction to metrics-end.sh - Reviewer approved - all acceptance criteria met Phase 2 Changes: - Parse agent transcripts for real token usage data - Calculate costs using model-specific pricing tables - Extract model names from API responses - Maintain backward compatibility (same JSONL schema) - Added analyst context and spec documents - Updated lessons-learned with transcript parsing patterns Phase 1 + Phase 2 now complete: ✅ Hook infrastructure (Phase 1) ✅ Real token data collection (Phase 2) Next: Merge to main when ready Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Created dedicated documentation for experimental usage tracking feature: New files: - artifacts/usage-metrics/README.md - Complete user guide - Setup instructions (settings.json configuration) - How it works (execution flow, token extraction, cost calc) - Data format documentation (JSONL schema, BD comments) - Query examples (basic + advanced aggregations) - Use cases (benchmarking, leak detection, distribution analysis) - Troubleshooting guide - Current limitations and future enhancements - artifacts/usage-metrics/TESTING.md - Test cases and validation - Unit tests (token parsing, cost calculation, duration) - Edge cases (empty transcript, malformed JSON, large numbers) - Integration test procedures - Validation checklists for future changes Modified files: - README.md - Added experimental usage tracking section - Links to dedicated documentation - Brief description of use cases - Status: experimental, requires manual setup - commands/develop.md - Removed user documentation - develop.md is master agent instructions, not user docs - Replaced with brief note for master agent awareness - User documentation now in artifacts/usage-metrics/README.md Captures ad-hoc tests from Phase 1+2 implementation for regression testing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Created a simple greeting utility in utils/greeting.js with: - Input validation (requires non-empty string) - Clear error messages - Comprehensive test suite This utility serves as a test case for validating the metrics collection workflow in the Claude Golem development process. Closes: .claude-782 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Ran comprehensive test suite from artifacts/usage-metrics/TESTING.md: - 31/31 unit tests passed - Token parsing and aggregation verified - Cost calculation with all pricing tiers validated - Edge cases handled correctly (empty/missing/malformed files) - Documentation complete and accurate Test coverage: - Phase 1 infrastructure (hooks, duration, escaping) - Phase 2 token data (extraction, models, costs) - Edge cases (graceful degradation) - Documentation verification Closes .claude-782 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Automated commit from test-usage-collection.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implemented generateGreeting(name) in utils/greeting.js that: - Returns 'Hello, [name]!' for valid names - Returns 'Hello, Guest!' for empty/null/undefined values Updated tests to validate all edge cases. All tests passing. Closes: .claude-786 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Hooks were at root-level hooks/ which is not committed to git. Moved to .claude/hooks/ so they are: - Version controlled - Available in claude-sandbox remote execution - Properly resolved by $CLAUDE_PROJECT_DIR/.claude/hooks path Updated test script to check project-local location first. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Automated commit from test-usage-collection.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Verified generateGreeting() utility function implementation with all tests passing. Task completed as part of usage metrics collection workflow testing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Executed full workflow (analyst→planner→implementer→reviewer) to validate usage metrics collection infrastructure. All stages completed successfully. Key Findings: - Metrics JSONL capture works correctly for all 4 stages - Token usage, cost, duration tracked accurately - Identified issue: TASK env var not passed to subagents - BD comment posting skipped (defensive guard worked as designed) Changes: - .beads/issues.jsonl: Closed .claude-790 and all subtasks - workflow-metrics.jsonl: Added 8 events (4 stage_start + 4 stage_end) - artifacts/lessons-learned.md: Documented TASK env var issue - .claude/analysis/.claude-790.1-context.md: Analyst context doc Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

1. Fix status detection - only mark as blocked on explicit blocking Previously matched any occurrence of 'blocked' in transcript Now requires explicit blocking statements 2. Filter non-workflow agents - only track analyst/planner/implementer/reviewer Prevents spurious events from master and other agents Fixes empty stage events in metrics 3. Extract task ID from prompt when TASK env var not set Falls back to parsing beads task ID from agent prompt Format: .claude-xxx or .claude-xxx.N (truncated to 50 chars) 4. Update test script to create honest test tasks Clear about testing purpose, requests full workflow explicitly Addresses issues #1-5 from metrics review. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Executed second full workflow run (analyst→planner→implementer→reviewer) to validate usage metrics collection with efficiency improvements. Performance Improvements (vs run 1): - Analyst: 191s → 38s (80% faster), $0.92 → $0.45 (51% cheaper) - Planner: 66s → 60s (9% faster), $0.39 → $0.26 (33% cheaper) - Implementer: 86s → 2474s*, $0.19 (see note) - Reviewer: 116s → 955s*, $1.05 (see note) *Note: Implementer/Reviewer durations include wait time, not just processing Metrics Validation: - ✓ stage_start/stage_end events captured correctly - ✓ Token counts, costs, durations tracked accurately - ✓ Model identification working (opus-4-5, sonnet-4-5) - ⚠ TASK env var issue persists (expected, needs separate fix) Changes: - .beads/issues.jsonl: Closed .claude-790 and 4 new subtasks (run 2) - workflow-metrics.jsonl: Added 8 new events (4 start + 4 end for run 2) - artifacts/lessons-learned.md: Added run comparison data, verification patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

1. Task ID extraction from agent transcript - Read from agent_transcript_path (first user message) - Extract beads task ID pattern from message content - Handles .claude-xxx.N format correctly 2. Fix status detection false positive - Removed transcript grep that matched metrics JSONL in tool results - Always use 'completed' since SubagentStop means agent finished - Agents update task status via bd commands, not transcript These fixes address the remaining issues from the second test run. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Created .claude-791 for testing metrics collection after all fixes: - Task ID extraction from transcript - Status detection fix (always completed) - Agent type filtering Task will validate all metrics are captured correctly. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implemented formatTimestamp(date) utility function that returns ISO 8601 formatted timestamps with proper edge case handling. Implementation completed through full workflow: - Analyst: Explored codebase and created specification - Planner: Created detailed implementation plan - Implementer: Built working code following patterns - Reviewer: Validated quality and documented lessons Files added: - utils/time-formatter.js - Main utility function - utils/time-formatter.test.js - Test suite (6 tests, all passing) - artifacts/timestamp-formatter-spec.md - Specification - .claude/analysis/.claude-791.1-context.md - Analysis context All tests pass. Follows existing project patterns from greeting.js. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added debug logging to both metrics-start.sh and metrics-end.sh to trace hook execution and diagnose why hooks don't fire in sandboxed environments. Debug log location: /Users/tomas/.claude/hooks-debug.log Logged information: - Hook trigger timestamp and environment (PWD, CLAUDE_PROJECT_DIR, TASK, etc.) - Raw hook input JSON (first 500 chars) - Parsed values (agent_id, agent_type, session_id, transcript paths) - Task extraction results - JSON generation output - File write results and exit codes - BD comment posting results This will help diagnose the .claude-791 sandbox issue where no metrics were captured despite full workflow execution. Related: .claude-792 (P1 bug) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Complete troubleshooting journey covering: - 7 distinct issues identified and resolved - Hook location, script errors, git workflow, TTY requirements - Task ID extraction, status detection, spurious events - Root cause analysis for each issue - Test results showing progression from 0 to 100% working - Production readiness assessment - Query examples and use cases Serves as reference for future debugging and onboarding. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add smart dirty repository validation before git checkout: - Check for uncommitted changes before checkout/reset - Auto-sync beads changes if detected (bd sync) - Provide clear error messages with actionable solutions - Prevent data loss from git reset --hard This fixes the error "Your local changes would be overwritten by checkout" that occurs when .beads/issues.jsonl has uncommitted changes. Root cause: Beads writes to .beads.db (SQLite) but requires explicit 'bd sync' to export changes to .beads/issues.jsonl (git-tracked). When sandbox tries to update repository, git protects uncommitted work. Solution handles the common case automatically while preserving safety for other uncommitted changes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

landovsky and others added 27 commits February 3, 2026 14:49

add test-usage-collection.sh

a6b1551

add test-usage-collection.sh

13eeec1

add test-usage-collection.sh

dfbc2ac

add test-usage-collection.sh

01f90f6

test: usage metrics collection test run

e775ed7

Automated commit from test-usage-collection.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

test: usage metrics collection test run

542e99c

Automated commit from test-usage-collection.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

test: usage metrics collection test run

aace434

Automated commit from test-usage-collection.sh Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

chore: close test task .claude-788 for usage metrics validation

89baad3

Verified generateGreeting() utility function implementation with all tests passing. Task completed as part of usage metrics collection workflow testing. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Usage Metrics Collection (Phase 1 + 2): Track workflow costs per task and stage#6

Usage Metrics Collection (Phase 1 + 2): Track workflow costs per task and stage#6
landovsky wants to merge 27 commits intomainfrom
feature/claude-l0a-usage-metrics-phase1

landovsky commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

landovsky commented Feb 3, 2026

Summary

Implementation

Phase 1: Infrastructure

Phase 2: Real Token Data

Use Cases Enabled

Data Format

Files Changed

Testing

Validation Process

Known Limitations

Merge Checklist

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant