Skip to content

Comments

Usage Metrics Collection (Phase 1 + 2): Track workflow costs per task and stage#6

Open
landovsky wants to merge 27 commits intomainfrom
feature/claude-l0a-usage-metrics-phase1
Open

Usage Metrics Collection (Phase 1 + 2): Track workflow costs per task and stage#6
landovsky wants to merge 27 commits intomainfrom
feature/claude-l0a-usage-metrics-phase1

Conversation

@landovsky
Copy link
Owner

Summary

Implements complete usage metrics collection for the development workflow. Tracks token usage, costs, and duration for each workflow stage (analyst, planner, implementer, reviewer) with git-persisted JSONL storage and human-readable beads comments.

Addresses: Complete blindness to workflow costs (9-10/10 pain identified in validation)

Implementation

Phase 1: Infrastructure

  • ✅ SubagentStart/SubagentStop hook scripts
  • ✅ JSONL persistence (.claude/workflow-metrics.jsonl)
  • ✅ BD comment integration for per-task summaries
  • ✅ Portable date handling (macOS/Linux compatibility)
  • ✅ Robust error handling (exit 0 always, never breaks workflow)

Phase 2: Real Token Data

  • ✅ Agent transcript parsing for token counts
  • ✅ Model-specific cost calculation (5 pricing tiers: Opus 4.5, Opus 4, Sonnet 4, Haiku 4.5, Haiku 3.5)
  • ✅ Real usage data: input/output tokens, cache read/creation
  • ✅ Model name extraction from API responses
  • ✅ Backward compatible (same JSONL schema, graceful fallback)

Use Cases Enabled

  1. Benchmarking: Compare workflow vs fast-track costs to decide if overhead is worth it
  2. Token leak detection: Spot anomalies (e.g., task used 5x normal tokens) → investigate bugs
  3. Understanding distribution: Which stages cost most? → inform optimization decisions

Data Format

JSONL Entry:

{
  "event": "stage_end",
  "timestamp": "2026-02-03T14:25:32Z",
  "stage": "planner",
  "task": "Plan OTLP data integration",
  "duration_seconds": 107,
  "tokens": {
    "input": 12500,
    "output": 3200,
    "cache_read": 8000,
    "cache_creation": 15000
  },
  "cost_usd": 0.0523,
  "model": "claude-sonnet-4-5-20250929"
}

Query Examples:

# Total cost by stage
jq -s 'group_by(.stage) | map({stage: .[0].stage, total_cost: (map(.cost_usd // 0) | add)})' .claude/workflow-metrics.jsonl

# Average duration per stage
jq -s 'group_by(.stage) | map({stage: .[0].stage, avg_duration: (map(.duration_seconds) | add / length)})' .claude/workflow-metrics.jsonl

# Most expensive tasks
jq 'select(.event=="stage_end")' .claude/workflow-metrics.jsonl | jq -s 'sort_by(-.cost_usd) | .[0:10]'

Files Changed

New:

  • .claude/hooks/metrics-start.sh - Captures stage start events
  • .claude/hooks/metrics-end.sh - Captures end events with duration, tokens, cost
  • .claude/analysis/.claude-6t5.1-context.md - Phase 2 analyst context
  • .claude/specs/.claude-6t5.1-spec.md - Phase 2 specification

Modified:

  • .claude/agents/master.md - Added $TASK env var setup instructions
  • .claude/commands/develop.md - Added usage metrics documentation
  • artifacts/lessons-learned.md - Captured hook patterns, jq usage, transcript parsing

Note: settings.json not included (user-specific). Users must manually add hooks (documented in develop.md).

Testing

Phase 1:

  • ✅ Unit tests (malformed JSON, missing TASK, truncation)
  • ✅ Integration test (duration calculation: 19 seconds measured)
  • ✅ Edge cases (blocked status, session interruption)

Phase 2:

  • ✅ Token parsing (250 in / 125 out / 500 cache validated)
  • ✅ Cost calculation ($0.0050 verified with manual math)
  • ✅ Model pricing patterns (all 5 tiers + fallback)
  • ✅ Edge cases (empty transcript, missing file, unknown model, large numbers)

Validation Process

  • Validated through /validate workflow
  • Challenge Q&A confirmed high ROI (high value, small effort, low risk)
  • Scoped to Phase 1+2 only (no dashboards/automation yet)
  • Full workflow execution: analyst → planner → implementer → reviewer (both phases)

Known Limitations

  • Fast-track tasks (master only) not tracked - only subagent stages
  • Concurrent subagents not tested - sequential workflow only
  • Requires manual settings.json update per user

Merge Checklist

  • All acceptance criteria met
  • Tests passed (unit + integration)
  • Code reviewed and approved
  • Documentation complete
  • Backward compatible (Phase 1 → Phase 2 seamless)
  • No breaking changes

Related Issues

  • Closes .claude-l0a (Phase 1: Usage data collection infrastructure)
  • Closes .claude-6t5 (Phase 2: Real OTLP token data integration)

🤖 Generated through full /develop workflow with analyst, planner, implementer, and reviewer agents.

landovsky and others added 27 commits February 3, 2026 14:49
Implements SubagentStart and SubagentStop hooks to capture workflow stage
execution metrics. The hooks write to workflow-metrics.jsonl with full schema
including mock token fields for Phase 1.

Key features:
- Portable date handling (BSD/GNU date compatibility)
- Graceful error handling (exit 0 always)
- TASK truncation to 50 chars
- Duration calculation via timestamp correlation
- Blocked status detection from transcript

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates master agent instructions to set TASK env var before subagent
invocations, enabling metrics collection per workflow stage.

Adds usage metrics documentation to develop.md command, including:
- Data storage locations (JSONL + bd comments)
- Query examples for analyzing metrics
- Phase 1 limitations (mock token values, no fast-track tracking)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
All workflow tasks completed (.claude-l0a and subtasks):
- Analyst validated technical approach and produced spec
- Planner designed implementation with portable date handling
- Implementer created hook scripts and documentation
- Reviewer approved with no critical issues

Changes:
- Added analyst context document for future reference
- Updated lessons-learned with hook patterns and jq usage
- Removed test data from workflow-metrics.jsonl
- Closed all beads tasks (analyst, planner, implementer, reviewer)

Implementation is ready for use. Users need to manually add hooks to
their local settings.json (documented in develop.md).

Phase 2 will add real OTLP token data to replace mock values.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements Phase 2 OTLP integration to extract real token counts
and costs from subagent transcript files.

Changes:
- Extract agent_transcript_path from hook input JSON
- Parse transcript JSONL to sum tokens across all API calls
  - input_tokens, output_tokens, cache_read_input_tokens
  - cache_creation (ephemeral_5m + ephemeral_1h)
- Extract model name from last assistant message
- Calculate cost_usd using model-specific pricing table
- Map model patterns to pricing tiers (opus/sonnet/haiku 4.x/3.5)
- Update JSON generation to use real values instead of nulls
- Update bd comment format to display real metrics

Fallback behavior:
- If agent_transcript_path missing/invalid: fall back to null/zero
- If transcript parsing fails: fall back to null/zero
- If model unknown: use sonnet-4 pricing (conservative)
- Zero tokens converted to null in JSON (maintains Phase 1 schema)

Technical details:
- Uses jq for safe token summing (handles large numbers)
- Uses awk for floating point cost calculation (4 decimal places)
- Maintains exit 0 always pattern (never breaks workflow)
- Preserves JSONL schema from Phase 1 (drop-in replacement)

Related: .claude-6t5.3 (Implement OTLP integration)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
All workflow tasks completed (.claude-6t5 and subtasks):
- Analyst discovered transcript parsing approach (OTLP not accessible)
- Planner designed transcript parsing with cost calculation
- Implementer added real token extraction to metrics-end.sh
- Reviewer approved - all acceptance criteria met

Phase 2 Changes:
- Parse agent transcripts for real token usage data
- Calculate costs using model-specific pricing tables
- Extract model names from API responses
- Maintain backward compatibility (same JSONL schema)
- Added analyst context and spec documents
- Updated lessons-learned with transcript parsing patterns

Phase 1 + Phase 2 now complete:
✅ Hook infrastructure (Phase 1)
✅ Real token data collection (Phase 2)

Next: Merge to main when ready

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Created dedicated documentation for experimental usage tracking feature:

New files:
- artifacts/usage-metrics/README.md - Complete user guide
  - Setup instructions (settings.json configuration)
  - How it works (execution flow, token extraction, cost calc)
  - Data format documentation (JSONL schema, BD comments)
  - Query examples (basic + advanced aggregations)
  - Use cases (benchmarking, leak detection, distribution analysis)
  - Troubleshooting guide
  - Current limitations and future enhancements

- artifacts/usage-metrics/TESTING.md - Test cases and validation
  - Unit tests (token parsing, cost calculation, duration)
  - Edge cases (empty transcript, malformed JSON, large numbers)
  - Integration test procedures
  - Validation checklists for future changes

Modified files:
- README.md - Added experimental usage tracking section
  - Links to dedicated documentation
  - Brief description of use cases
  - Status: experimental, requires manual setup

- commands/develop.md - Removed user documentation
  - develop.md is master agent instructions, not user docs
  - Replaced with brief note for master agent awareness
  - User documentation now in artifacts/usage-metrics/README.md

Captures ad-hoc tests from Phase 1+2 implementation for regression testing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Created a simple greeting utility in utils/greeting.js with:
- Input validation (requires non-empty string)
- Clear error messages
- Comprehensive test suite

This utility serves as a test case for validating the metrics
collection workflow in the Claude Golem development process.

Closes: .claude-782

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Ran comprehensive test suite from artifacts/usage-metrics/TESTING.md:
- 31/31 unit tests passed
- Token parsing and aggregation verified
- Cost calculation with all pricing tiers validated
- Edge cases handled correctly (empty/missing/malformed files)
- Documentation complete and accurate

Test coverage:
- Phase 1 infrastructure (hooks, duration, escaping)
- Phase 2 token data (extraction, models, costs)
- Edge cases (graceful degradation)
- Documentation verification

Closes .claude-782

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Automated commit from test-usage-collection.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented generateGreeting(name) in utils/greeting.js that:
- Returns 'Hello, [name]!' for valid names
- Returns 'Hello, Guest!' for empty/null/undefined values

Updated tests to validate all edge cases. All tests passing.

Closes: .claude-786

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Hooks were at root-level hooks/ which is not committed to git.
Moved to .claude/hooks/ so they are:
- Version controlled
- Available in claude-sandbox remote execution
- Properly resolved by $CLAUDE_PROJECT_DIR/.claude/hooks path

Updated test script to check project-local location first.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Automated commit from test-usage-collection.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Automated commit from test-usage-collection.sh

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Verified generateGreeting() utility function implementation with all tests passing.
Task completed as part of usage metrics collection workflow testing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Executed full workflow (analyst→planner→implementer→reviewer) to validate
usage metrics collection infrastructure. All stages completed successfully.

Key Findings:
- Metrics JSONL capture works correctly for all 4 stages
- Token usage, cost, duration tracked accurately
- Identified issue: TASK env var not passed to subagents
- BD comment posting skipped (defensive guard worked as designed)

Changes:
- .beads/issues.jsonl: Closed .claude-790 and all subtasks
- workflow-metrics.jsonl: Added 8 events (4 stage_start + 4 stage_end)
- artifacts/lessons-learned.md: Documented TASK env var issue
- .claude/analysis/.claude-790.1-context.md: Analyst context doc

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1. Fix status detection - only mark as blocked on explicit blocking
   Previously matched any occurrence of 'blocked' in transcript
   Now requires explicit blocking statements

2. Filter non-workflow agents - only track analyst/planner/implementer/reviewer
   Prevents spurious events from master and other agents
   Fixes empty stage events in metrics

3. Extract task ID from prompt when TASK env var not set
   Falls back to parsing beads task ID from agent prompt
   Format: .claude-xxx or .claude-xxx.N (truncated to 50 chars)

4. Update test script to create honest test tasks
   Clear about testing purpose, requests full workflow explicitly

Addresses issues #1-5 from metrics review.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Executed second full workflow run (analyst→planner→implementer→reviewer)
to validate usage metrics collection with efficiency improvements.

Performance Improvements (vs run 1):
- Analyst: 191s → 38s (80% faster), $0.92 → $0.45 (51% cheaper)
- Planner: 66s → 60s (9% faster), $0.39 → $0.26 (33% cheaper)
- Implementer: 86s → 2474s*, $0.19 (see note)
- Reviewer: 116s → 955s*, $1.05 (see note)

*Note: Implementer/Reviewer durations include wait time, not just processing

Metrics Validation:
- ✓ stage_start/stage_end events captured correctly
- ✓ Token counts, costs, durations tracked accurately
- ✓ Model identification working (opus-4-5, sonnet-4-5)
- ⚠ TASK env var issue persists (expected, needs separate fix)

Changes:
- .beads/issues.jsonl: Closed .claude-790 and 4 new subtasks (run 2)
- workflow-metrics.jsonl: Added 8 new events (4 start + 4 end for run 2)
- artifacts/lessons-learned.md: Added run comparison data, verification patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1. Task ID extraction from agent transcript
   - Read from agent_transcript_path (first user message)
   - Extract beads task ID pattern from message content
   - Handles .claude-xxx.N format correctly

2. Fix status detection false positive
   - Removed transcript grep that matched metrics JSONL in tool results
   - Always use 'completed' since SubagentStop means agent finished
   - Agents update task status via bd commands, not transcript

These fixes address the remaining issues from the second test run.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Created .claude-791 for testing metrics collection after all fixes:
- Task ID extraction from transcript
- Status detection fix (always completed)
- Agent type filtering

Task will validate all metrics are captured correctly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented formatTimestamp(date) utility function that returns ISO 8601
formatted timestamps with proper edge case handling.

Implementation completed through full workflow:
- Analyst: Explored codebase and created specification
- Planner: Created detailed implementation plan
- Implementer: Built working code following patterns
- Reviewer: Validated quality and documented lessons

Files added:
- utils/time-formatter.js - Main utility function
- utils/time-formatter.test.js - Test suite (6 tests, all passing)
- artifacts/timestamp-formatter-spec.md - Specification
- .claude/analysis/.claude-791.1-context.md - Analysis context

All tests pass. Follows existing project patterns from greeting.js.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added debug logging to both metrics-start.sh and metrics-end.sh to trace
hook execution and diagnose why hooks don't fire in sandboxed environments.

Debug log location: /Users/tomas/.claude/hooks-debug.log

Logged information:
- Hook trigger timestamp and environment (PWD, CLAUDE_PROJECT_DIR, TASK, etc.)
- Raw hook input JSON (first 500 chars)
- Parsed values (agent_id, agent_type, session_id, transcript paths)
- Task extraction results
- JSON generation output
- File write results and exit codes
- BD comment posting results

This will help diagnose the .claude-791 sandbox issue where no metrics
were captured despite full workflow execution.

Related: .claude-792 (P1 bug)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Complete troubleshooting journey covering:
- 7 distinct issues identified and resolved
- Hook location, script errors, git workflow, TTY requirements
- Task ID extraction, status detection, spurious events
- Root cause analysis for each issue
- Test results showing progression from 0 to 100% working
- Production readiness assessment
- Query examples and use cases

Serves as reference for future debugging and onboarding.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add smart dirty repository validation before git checkout:
- Check for uncommitted changes before checkout/reset
- Auto-sync beads changes if detected (bd sync)
- Provide clear error messages with actionable solutions
- Prevent data loss from git reset --hard

This fixes the error "Your local changes would be overwritten by checkout"
that occurs when .beads/issues.jsonl has uncommitted changes.

Root cause: Beads writes to .beads.db (SQLite) but requires explicit
'bd sync' to export changes to .beads/issues.jsonl (git-tracked).
When sandbox tries to update repository, git protects uncommitted work.

Solution handles the common case automatically while preserving
safety for other uncommitted changes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant