Add report verification tool #7

kenjudy · 2025-12-14T02:48:35Z

No description provided.

- Created verify_analysis_report.py to regenerate all statistics from trace data - Supports optional --expected-values parameter for PASS/FAIL verification - Full test coverage (16 tests) with mocked analysis functions - Type-safe implementation with mypy strict mode - All CI checks pass (black, ruff, mypy, bandit) - Updated README with usage documentation Verification tool regenerates latency distribution, bottleneck analysis, and parallel execution metrics deterministically for audit purposes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implements configurable cost analysis for LangSmith traces following strict test-driven development methodology. - PricingConfig: Configurable dataclass for any LLM provider pricing - TokenUsage: Extract token data from trace outputs/inputs - CostBreakdown: Calculate costs with input/output/cache breakdown - Full test coverage: 12 tests passing Test-first approach (RED-GREEN-REFACTOR): - Tests written before implementation - Minimal implementation to pass tests - All validations and edge cases covered Phase 3B implementation plan documented in plans/ directory. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implemented WorkflowCostAnalysis and calculate_workflow_cost() following test-first approach. Tests passing: 14/14. Features: - Aggregate costs across all traces in workflow - Track node-level cost breakdowns - Sum total tokens across workflow - Handle workflows with no token data gracefully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implemented ScalingProjection and project_scaling_costs() following test-first approach. Tests passing: 17/17. Features: - Project costs at 1x, 10x, 100x, 1000x scale factors - Calculate monthly cost estimates if provided - Handle zero-cost scenarios gracefully - Configurable scaling factors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Completed Phase 3B cost analysis implementation following TDD. Tests passing: 20/20. Features: - NodeCostSummary and CostAnalysisResults dataclasses - aggregate_node_costs() - aggregate by node type with percentages - analyze_costs() - main orchestration function - Complete end-to-end cost analysis workflow - Configurable pricing model (PricingConfig) - Scaling projections at 1x, 10x, 100x, 1000x - Node-level cost breakdowns - Data quality tracking Phase 3B COMPLETE - Ready for real data analysis! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixed unused imports and applied Black auto-formatting. All quality checks passing: - ✅ Ruff: No linting issues - ✅ Black: Formatted to standard - ✅ Mypy: No type errors - ✅ Bandit: Only expected test assertions (low severity) - ✅ Tests: 20/20 passing Ready for CI/CD pipeline. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implemented core failure analysis functions following TDD. Tests passing: 13/13 for failure detection and retry sequences. Features: - FailureInstance, RetrySequence data structures - detect_failures() - identify failures from trace status - classify_error() - regex-based error classification - detect_retry_sequences() - heuristic retry detection - calculate_retry_success_rate() - retry effectiveness metric Phase 3C foundation complete - ready for node analysis. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…(TDD GREEN) Implemented node failure analysis and main orchestration function. Tests passing: 15/15 for complete Phase 3C. Features: - analyze_node_failures() - aggregate failures by node type - Node-level stats: execution count, failure rate, retry sequences - Error type tracking per node - analyze_failures() - main orchestration function - Overall success rate calculation - Error distribution aggregation - Retry success rate analysis Phase 3C COMPLETE with code quality checks passing: - ✅ Ruff: No linting issues - ✅ Black: Formatted - ✅ Mypy: No type errors - ✅ Tests: 15/15 passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Added verify_cost_analysis() and verify_failure_analysis() functions to verify_analysis_report.py with command-line control. Features: - verify_cost_analysis() - Verify Phase 3B cost calculations * Workflow cost statistics (avg, median, range) * Top 3 cost drivers by node * Scaling projections (1x, 10x, 100x, 1000x) * Cache effectiveness if available - verify_failure_analysis() - Verify Phase 3C failure calculations * Overall success/failure rates * Top 5 nodes by failure rate * Error distribution analysis * Retry sequence analysis * Validator effectiveness - New CLI arguments: * --phases: Select 3a, 3b, 3c, or all (default: 3a) * --pricing-model: Choose pricing model for cost analysis Usage examples: python verify_analysis_report.py traces.json --phases all python verify_analysis_report.py traces.json --phases 3b python verify_analysis_report.py traces.json --phases 3c All quality checks passing (Ruff, Black, Mypy). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Update README with comprehensive Phase 3B cost analysis docs - Update README with comprehensive Phase 3C failure analysis docs - Update verification tool CLI examples with --phases argument - Update test counts: 99 total tests (33 + 31 + 20 + 15) - Update project structure with new modules - Fix mypy type errors: - Add return type annotation to PricingConfig.__post_init__() - Filter None start_time traces in retry detection - Fix max() key function for error distribution - All 35 tests passing (20 cost + 15 failure) - All quality checks passing (Ruff, Black, Mypy, Bandit)

- Create run_phase3bc_analysis.py for automated Phase 3B/3C analysis - Fix analyze_cost.py to handle None outputs/inputs gracefully - Generates intermediate JSON data files for Assessment - Reports limitations when token usage data unavailable - All tests still passing (20 cost + 15 failure)

- Removed run_phase3bc_analysis.py (client-specific naming and paths) - Analysis tools remain generic and reusable - Client-specific analysis scripts should live in client repos

Implemented using TDD (RED-GREEN-REFACTOR): - RED: Added 2 failing tests for token export - GREEN: Added total_tokens, prompt_tokens, completion_tokens to trace export - REFACTOR: Fixed integration tests to include token fields in mocks Changes: - export_langsmith_traces.py: Extract token fields from Run objects - test_export_langsmith_traces.py: Add token usage tests + update mocks Token fields exported: - total_tokens: Total tokens used (LLM runs only) - prompt_tokens: Input/prompt tokens (LLM runs only) - completion_tokens: Generated/output tokens (LLM runs only) - All fields gracefully handle None for non-LLM runs All 133 tests passing. Enables Phase 3B cost analysis with real token usage data.

Modified extract_token_usage() to check top-level trace fields first: - total_tokens - prompt_tokens - completion_tokens This supports the updated export format where token data is exported at the trace level (not nested in outputs/usage_metadata). Maintains backwards compatibility with legacy format in outputs/inputs. All 20 cost analysis tests still passing.

Extended Trace dataclass with token fields: - total_tokens: Total tokens used (None for non-LLM traces) - prompt_tokens: Input/prompt tokens (None for non-LLM traces) - completion_tokens: Output/completion tokens (None for non-LLM traces) Updated _build_trace_from_dict() to load token fields from JSON. This completes the end-to-end token tracking chain: 1. Export: Token data exported at trace level 2. Loading: Token fields loaded into Trace objects 3. Analysis: Cost analysis extracts and calculates costs Verified with test showing ~$0.14 avg cost per workflow. All 133 tests passing.

Add cache_read_tokens and cache_creation_tokens fields to trace export to enable cache effectiveness measurement in Phase 3B cost analysis. Changes: - Extract cache tokens from nested outputs/inputs["usage_metadata"]["input_token_details"] - Support both cache_creation and cache_creation_input_tokens field names - Multi-level fallback: top-level -> outputs -> inputs - Preserve 0 values correctly (explicit None checks) - Add 3 comprehensive tests for nested extraction and backward compatibility - All 46 tests passing Implements TDD approach with test-first development following PDCA framework. Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Fixed cache token extraction to look in the correct location for LangSmith exports that use LangChain-serialized AIMessage format. Changes: - Added Fallback 1: Extract from outputs.generations[0][0].message.kwargs.usage_metadata - Updated test to use correct LangChain message structure - Verified with 1000-trace export: 684 runs now have cache_read_tokens The LangSmith Python SDK exports token metadata in LangChain AIMessage format under generations[0][0].message.kwargs.usage_metadata, not directly under outputs.usage_metadata. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Applied black code formatter to maintain consistent code style. Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Added test file exclusion pattern to .bandit config to avoid B101 (assert_used) warnings in pytest test files where asserts are expected and appropriate. For CI/CD, run: bandit -r export_langsmith_traces.py Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove client-specific references from codebase: - Replace specific node names with generic examples in docs - Update test data to use generic node names (process_data, transform_output) - Delete temporary debug scripts with client file paths - Fix mypy type errors in cache effectiveness functions: - Add explicit None checks for cached_tokens in analyze_cost.py - Ensure type safety in cache calculations - Code quality improvements: - Apply black formatting - All 146 tests passing - Mypy strict mode passing on all source files - No security issues (bandit) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>