feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51
Open
Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
Open
feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
Conversation
Transform all monitoring agents to autonomous problem detectors and implement
a production-ready usage monitoring system with statistically-derived alerting
thresholds based on 60-day analysis.
Key Changes:
1. Autonomous Monitoring Across All Agents
- Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime)
- Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis"
- Auto-analyze ALL metrics, segments, and time windows without user prompting
- Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline)
2. Production Usage Monitoring Implementation
- Data-driven thresholds per segment based on 60-day volatility analysis
- 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns)
- Segment-specific thresholds:
* Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only)
* Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens
* Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens
* Free: -25% DAU, -35% Image Gens, -20% Tokens
- Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold)
- Weekend suppression for Enterprise (weekday-only alerting)
- Skip Enterprise video generation alerts (CV > 100%, too volatile)
3. Root Cause Investigation Workflow
- Enterprise segments: Drill down to organization level to identify which clients drove drops
- Other segments: Analyze by tier distribution (Standard vs Pro vs Lite)
- Alert format includes current vs baseline, drop %, threshold, and recommended actions
4. Full Segmentation Enforcement
- Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md
- Enforces proper hierarchy: Enterprise → Heavy → Paying → Free
- Consistent segmentation across all usage monitoring queries
Technical Details:
- Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)
- 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt))
- Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date
- Partition pruning on dt (DATE) for performance
Anti-Patterns Addressed:
- Generic thresholds replaced with segment-specific, data-driven values
- Day-of-week effects handled via same-DOW comparison (not DoD on weekends)
- Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy)
- Enterprise video gen alerts skipped (single-user dominated, CV > 100%)
- Small segment noise handled via production threshold calibration
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Keep only production monitoring implementation (usage_monitoring_v2.py)
Reference data-driven thresholds from 60-day analysis: - Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio - Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump - Tier 3 Low Priority: Volume drop, Overhead spike - Per-vertical thresholds for LTX API and LTX Studio Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Include comprehensive shared knowledge files: - product-context.md for business model and user types - bq-schema.md for subscription tables and segmentation - metric-standards.md for revenue metric definitions - event-registry.yaml for feature-driven revenue analysis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Expand Step 2 to include all shared knowledge files: - product-context.md for LTX products and API context - bq-schema.md for API tables and GPU cost data - metric-standards.md for performance metrics - event-registry.yaml for event-driven metrics - gpu-cost-query-templates.md for cost-related performance - gpu-cost-analysis-patterns.md for cost analysis patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Changes: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules + Anti-Patterns into Constraints section - Moved Reference Files + Production Scripts to Context section - Added completion criteria - Kept under 500 lines (402 lines total) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rmat Convert all monitoring agents to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Updated skills: - be-cost-monitoring (314 lines) - GPU cost with production thresholds - revenue-monitor (272 lines) - Revenue/subscription monitoring - enterprise-monitor (341 lines) - Enterprise account health - api-runtime-monitor (359 lines) - API performance monitoring All skills: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules into Constraints section (DO/DO NOT) - Moved Reference Files to Context section - Added completion criteria - Kept under 500 lines per Agent Skills spec Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes: - Replace separate SQL + Python files with combined usage_monitor.py - Update from percentage thresholds to statistical 3σ method - Embed SQL query as string in Python script - Execute BigQuery directly (no CSV intermediate) - Add --date parameter for flexible date selection - Update SKILL.md to reference combined script Benefits: - Single file to maintain (353 lines vs 413 lines) - No intermediate CSV files needed - Easier to run and schedule - Self-contained SQL + alerting logic
Change focus from 'detecting drops' to 'detecting data spikes' to reflect that the statistical method detects anomalies in both directions: - Increases (e.g., +46.6% token spike on 2026-03-09) - Decreases (e.g., churn, engagement drops) Updated: - Overview: Problem solved now mentions both increases and decreases - Requirements: Changed 'drops' to 'spikes (increases or decreases)' - Description: Changed 'detecting usage drops' to 'detecting usage anomalies'
Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context.
…isclosure Major changes: - Reduced from 335 lines to 182 lines (-45%) - Removed duplicate SQL query (already in usage_monitor.py) - Removed duplicate alert format examples - Consolidated overlapping phases (Phase 4 + 5 → Phase 4) - Simplified DO/DO NOT section (removed repetitive rules) - Applied progressive disclosure (method → run → analyze → present) - Kept only essential information, reference scripts for details Benefits: - Clearer, more focused instructions - Less maintenance (single source of truth in Python script) - Easier to scan and understand - Follows Agent Skills spec better (<500 lines, minimal duplication)
1. Change date example from specific date to 'yesterday' 2. Remove '(skip for Enterprise - too volatile)' from video generation line The exception details are still preserved in Phase 1 where they belong.
… exceptions - Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3) - Compare yesterday's metrics instead of today's - Remove enterprise video gen exceptions - Add event-registry.yaml to shared knowledge - Remove Phase 6, Context & References section - Add Free segment to tier distribution checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major updates: - Add production thresholds from 60-day statistical analysis - Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low) - Change timing: analyze data from 3 days ago (cost data needs time to finalize) - Restructure to 6-part Agent Skills format - Add per-vertical thresholds (API vs Studio) Tier 1 alerts: - Idle cost spike > $15,600/day - Inference cost spike > $5,743/day - Idle-to-inference ratio > 4:1 Benefits: - Data-driven thresholds (not guesses) - Prioritized alerts by tier - Proper timing (3-day lookback) - Clear structure (6-part spec)
Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all shared knowledge files to Phase 2 references - Add comprehensive revenue metrics (MRR, ARR, churn, refunds, new subs) - Add segment-level analysis (tier + plan type) - Add baseline comparisons (7-day rolling average) - Add alert thresholds (generic, pending production analysis) Monitoring coverage: - Revenue drops by segment - MRR and ARR trends - Churn rate increases - Refund rate spikes - New subscription volume - Tier movements (upgrades/downgrades) - Enterprise contract renewals Benefits: - Comprehensive monitoring across all revenue metrics - Segment-level root cause analysis - Clear structure (6-part spec) - Baseline-driven alerting
Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Emphasize EXACT segmentation CTE usage from bq-schema.md (no modifications) - Add org-specific baseline comparisons (each org vs its 30-day average) - Add McCann split logic (McCann_NY vs McCann_Paris) - Add Contract vs Pilot account separation - Add power user tracking within orgs (top 20% by token usage) - Add quota monitoring and underutilization detection Enterprise-specific monitoring: - DAU/WAU/MAU drops per org (> 30% vs org baseline) - Token consumption vs contracted quota (< 50% utilization) - User activation (% of seats active) - Video/image generation engagement per org - Power user drops (> 20% decline) - Zero activity for 7+ consecutive days Benefits: - Org-specific baselines (not generic thresholds) - Churn risk detection - Customer success actionable alerts - Exact segmentation compliance
Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all 6 shared knowledge files to Phase 2 references - Add detailed data source explanation (ltxvapi tables and GPU cost table) - Add percentile calculations (P50/P95/P99 latency) - Add error type separation (infrastructure vs applicative) - Add baseline comparisons (7-day rolling average) - Add alert routing by error type (Engineering vs API/Product team) Performance monitoring coverage: - P95 latency spikes (> 2x baseline or > 60s) - Error rate increases (> 5% or DoD > 50%) - Throughput drops (> 30% DoD/WoW) - Queue time issues (> 50% of processing time) - Infrastructure errors (> 10 requests/hour) Benefits: - Comprehensive API performance monitoring - Endpoint/model/org breakdown for root cause - Error type routing to appropriate teams - Clear structure (6-part spec) - Baseline-driven alerting
- Remove duplications (data nuances, analysis patterns, routing) - Consolidate 7 phases into 3 clean phases - Apply statistical 2σ method like usage monitor - Reduce from 200 to 111 lines (-44.5%) - Update progress tracker to match 3-phase structure - Clean up DO rules (19 → 7 essential rules) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates 5 separate monitoring agent PRs into a single unified PR for easier review and deployment.
Consolidated PRs
This PR combines:
What's Included
1. BE Cost Monitoring (
agents/monitoring/be-cost/)2. API Runtime Monitoring (
agents/monitoring/api-runtime/)3. Enterprise Account Monitoring (
agents/monitoring/enterprise/)4. Revenue Monitoring (
agents/monitoring/revenue/)5. Usage Monitoring (
agents/monitoring/usage/)Key Features (All Agents)
✅ Statistical anomaly detection - Data-driven thresholds, not manual tuning
✅ Segment-aware - Different thresholds for Enterprise vs self-serve
✅ Day-of-week patterns - Accounts for weekly variance
✅ Prioritized alerts - High/Medium/Low severity tiers
✅ Autonomous execution - Designed for scheduled runs
✅ Shared knowledge integration - References bq-schema.md, metric-standards.md
Architecture
Test Plan
Supersedes
This PR supersedes and should close:
Related
shared/bq-schema.mdfor table schemasshared/metric-standards.mdfor metric definitionsshared/gpu-cost-query-templates.mdfor GPU cost query patternsshared/enterprise-token-pools.mdfor token pool logic