Skip to content

feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51

Open
Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
consolidate-monitoring-agents
Open

feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51
Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
consolidate-monitoring-agents

Conversation

@DanielB945
Copy link
Copy Markdown
Collaborator

Summary

Consolidates 5 separate monitoring agent PRs into a single unified PR for easier review and deployment.

Consolidated PRs

This PR combines:

What's Included

1. BE Cost Monitoring (agents/monitoring/be-cost/)

  • Autonomous cost monitoring with statistical anomaly detection
  • Data-driven thresholds from 60-day analysis
  • Monitors: idle costs, inference costs, utilization ratio, failure rate, cost-per-request
  • Analyzes data from 3 days ago (cost data finalization delay)
  • Vertical-specific thresholds (API vs Studio)

2. API Runtime Monitoring (agents/monitoring/api-runtime/)

  • Latency, error rate, and throughput monitoring
  • Request volume tracking by endpoint/model
  • Performance degradation detection
  • SLA violation alerts

3. Enterprise Account Monitoring (agents/monitoring/enterprise/)

  • Account health scoring and churn risk assessment
  • Quota consumption and overage tracking
  • User activation and feature adoption
  • QBR preparation and customer success insights

4. Revenue Monitoring (agents/monitoring/revenue/)

  • MRR, ARR, subscription, and churn tracking
  • Refund and payment failure monitoring
  • Revenue anomaly detection
  • Cohort analysis and retention metrics

5. Usage Monitoring (agents/monitoring/usage/)

  • Statistical anomaly detection (2σ threshold)
  • Same-day-of-week baseline comparison (10 data points)
  • Segment-specific monitoring (Enterprise, Pilot, Pro, Standard, Lite, Free)
  • Autonomous Python script with BigQuery integration
  • NEW: Complete README with usage instructions and architecture

Key Features (All Agents)

Statistical anomaly detection - Data-driven thresholds, not manual tuning
Segment-aware - Different thresholds for Enterprise vs self-serve
Day-of-week patterns - Accounts for weekly variance
Prioritized alerts - High/Medium/Low severity tiers
Autonomous execution - Designed for scheduled runs
Shared knowledge integration - References bq-schema.md, metric-standards.md

Architecture

agents/monitoring/
  ├── SKILL.md                    ← Router agent (dispatches to sub-agents)
  ├── be-cost/SKILL.md           ← GPU cost monitoring
  ├── api-runtime/SKILL.md       ← API performance monitoring
  ├── enterprise/SKILL.md        ← Account health monitoring
  ├── revenue/SKILL.md           ← Financial metrics monitoring
  └── usage/
      ├── SKILL.md               ← Usage anomaly detection
      ├── README.md              ← NEW: Setup and usage guide
      ├── usage_monitor.py       ← Autonomous Python script
      ├── investigate_root_cause.sql ← Drill-down query
      └── references/
          └── query-templates.md ← SQL patterns

Test Plan

  • Review all monitoring agent workflows for consistency
  • Verify statistical thresholds match documented baselines
  • Test BigQuery access and permissions
  • Confirm alert severity prioritization logic
  • Validate segment definitions across all agents

Supersedes

This PR supersedes and should close:

Related

  • See shared/bq-schema.md for table schemas
  • See shared/metric-standards.md for metric definitions
  • See shared/gpu-cost-query-templates.md for GPU cost query patterns
  • See shared/enterprise-token-pools.md for token pool logic

Daniel Beer (DanielB945) and others added 25 commits March 8, 2026 23:05
Transform all monitoring agents to autonomous problem detectors and implement
a production-ready usage monitoring system with statistically-derived alerting
thresholds based on 60-day analysis.

Key Changes:

1. Autonomous Monitoring Across All Agents
   - Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime)
   - Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis"
   - Auto-analyze ALL metrics, segments, and time windows without user prompting
   - Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline)

2. Production Usage Monitoring Implementation
   - Data-driven thresholds per segment based on 60-day volatility analysis
   - 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns)
   - Segment-specific thresholds:
     * Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only)
     * Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens
     * Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens
     * Free: -25% DAU, -35% Image Gens, -20% Tokens
   - Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold)
   - Weekend suppression for Enterprise (weekday-only alerting)
   - Skip Enterprise video generation alerts (CV > 100%, too volatile)

3. Root Cause Investigation Workflow
   - Enterprise segments: Drill down to organization level to identify which clients drove drops
   - Other segments: Analyze by tier distribution (Standard vs Pro vs Lite)
   - Alert format includes current vs baseline, drop %, threshold, and recommended actions

4. Full Segmentation Enforcement
   - Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md
   - Enforces proper hierarchy: Enterprise → Heavy → Paying → Free
   - Consistent segmentation across all usage monitoring queries

Technical Details:
- Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)
- 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt))
- Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date
- Partition pruning on dt (DATE) for performance

Anti-Patterns Addressed:
- Generic thresholds replaced with segment-specific, data-driven values
- Day-of-week effects handled via same-DOW comparison (not DoD on weekends)
- Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy)
- Enterprise video gen alerts skipped (single-user dominated, CV > 100%)
- Small segment noise handled via production threshold calibration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Keep only production monitoring implementation (usage_monitoring_v2.py)
Reference data-driven thresholds from 60-day analysis:
- Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio
- Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump
- Tier 3 Low Priority: Volume drop, Overhead spike
- Per-vertical thresholds for LTX API and LTX Studio

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday.
Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Include comprehensive shared knowledge files:
- product-context.md for business model and user types
- bq-schema.md for subscription tables and segmentation
- metric-standards.md for revenue metric definitions
- event-registry.yaml for feature-driven revenue analysis

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Expand Step 2 to include all shared knowledge files:
- product-context.md for LTX products and API context
- bq-schema.md for API tables and GPU cost data
- metric-standards.md for performance metrics
- event-registry.yaml for event-driven metrics
- gpu-cost-query-templates.md for cost-related performance
- gpu-cost-analysis-patterns.md for cost analysis patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert to Agent Skills spec structure:
1. Overview (Why?) - Problem context and solution
2. Requirements (What?) - Checklist of outcomes
3. Progress Tracker - Visual step indicator
4. Implementation Plan - Phases with progressive disclosure
5. Context & References - Files, scripts, data sources
6. Constraints & Done - DO/DO NOT rules and completion criteria

Changes:
- Applied progressive disclosure (PREFERRED patterns first)
- Consolidated Rules + Anti-Patterns into Constraints section
- Moved Reference Files + Production Scripts to Context section
- Added completion criteria
- Kept under 500 lines (402 lines total)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rmat

Convert all monitoring agents to Agent Skills spec structure:
1. Overview (Why?) - Problem context and solution
2. Requirements (What?) - Checklist of outcomes
3. Progress Tracker - Visual step indicator
4. Implementation Plan - Phases with progressive disclosure
5. Context & References - Files, scripts, data sources
6. Constraints & Done - DO/DO NOT rules and completion criteria

Updated skills:
- be-cost-monitoring (314 lines) - GPU cost with production thresholds
- revenue-monitor (272 lines) - Revenue/subscription monitoring
- enterprise-monitor (341 lines) - Enterprise account health
- api-runtime-monitor (359 lines) - API performance monitoring

All skills:
- Applied progressive disclosure (PREFERRED patterns first)
- Consolidated Rules into Constraints section (DO/DO NOT)
- Moved Reference Files to Context section
- Added completion criteria
- Kept under 500 lines per Agent Skills spec

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes:
- Replace separate SQL + Python files with combined usage_monitor.py
- Update from percentage thresholds to statistical 3σ method
- Embed SQL query as string in Python script
- Execute BigQuery directly (no CSV intermediate)
- Add --date parameter for flexible date selection
- Update SKILL.md to reference combined script

Benefits:
- Single file to maintain (353 lines vs 413 lines)
- No intermediate CSV files needed
- Easier to run and schedule
- Self-contained SQL + alerting logic
Change focus from 'detecting drops' to 'detecting data spikes' to reflect
that the statistical method detects anomalies in both directions:
- Increases (e.g., +46.6% token spike on 2026-03-09)
- Decreases (e.g., churn, engagement drops)

Updated:
- Overview: Problem solved now mentions both increases and decreases
- Requirements: Changed 'drops' to 'spikes (increases or decreases)'
- Description: Changed 'detecting usage drops' to 'detecting usage anomalies'
Drop paragraph about segment/day-of-week variance details.
Keep Overview focused on the solution (statistical anomaly detection)
rather than the detailed problem context.
…isclosure

Major changes:
- Reduced from 335 lines to 182 lines (-45%)
- Removed duplicate SQL query (already in usage_monitor.py)
- Removed duplicate alert format examples
- Consolidated overlapping phases (Phase 4 + 5 → Phase 4)
- Simplified DO/DO NOT section (removed repetitive rules)
- Applied progressive disclosure (method → run → analyze → present)
- Kept only essential information, reference scripts for details

Benefits:
- Clearer, more focused instructions
- Less maintenance (single source of truth in Python script)
- Easier to scan and understand
- Follows Agent Skills spec better (<500 lines, minimal duplication)
1. Change date example from specific date to 'yesterday'
2. Remove '(skip for Enterprise - too volatile)' from video generation line

The exception details are still preserved in Phase 1 where they belong.
… exceptions

- Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3)
- Compare yesterday's metrics instead of today's
- Remove enterprise video gen exceptions
- Add event-registry.yaml to shared knowledge
- Remove Phase 6, Context & References section
- Add Free segment to tier distribution checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major updates:
- Add production thresholds from 60-day statistical analysis
- Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low)
- Change timing: analyze data from 3 days ago (cost data needs time to finalize)
- Restructure to 6-part Agent Skills format
- Add per-vertical thresholds (API vs Studio)

Tier 1 alerts:
- Idle cost spike > $15,600/day
- Inference cost spike > $5,743/day
- Idle-to-inference ratio > 4:1

Benefits:
- Data-driven thresholds (not guesses)
- Prioritized alerts by tier
- Proper timing (3-day lookback)
- Clear structure (6-part spec)
Major updates:
- Restructure to 6-part Agent Skills format (Overview → Constraints)
- Add all shared knowledge files to Phase 2 references
- Add comprehensive revenue metrics (MRR, ARR, churn, refunds, new subs)
- Add segment-level analysis (tier + plan type)
- Add baseline comparisons (7-day rolling average)
- Add alert thresholds (generic, pending production analysis)

Monitoring coverage:
- Revenue drops by segment
- MRR and ARR trends
- Churn rate increases
- Refund rate spikes
- New subscription volume
- Tier movements (upgrades/downgrades)
- Enterprise contract renewals

Benefits:
- Comprehensive monitoring across all revenue metrics
- Segment-level root cause analysis
- Clear structure (6-part spec)
- Baseline-driven alerting
Major updates:
- Restructure to 6-part Agent Skills format (Overview → Constraints)
- Emphasize EXACT segmentation CTE usage from bq-schema.md (no modifications)
- Add org-specific baseline comparisons (each org vs its 30-day average)
- Add McCann split logic (McCann_NY vs McCann_Paris)
- Add Contract vs Pilot account separation
- Add power user tracking within orgs (top 20% by token usage)
- Add quota monitoring and underutilization detection

Enterprise-specific monitoring:
- DAU/WAU/MAU drops per org (> 30% vs org baseline)
- Token consumption vs contracted quota (< 50% utilization)
- User activation (% of seats active)
- Video/image generation engagement per org
- Power user drops (> 20% decline)
- Zero activity for 7+ consecutive days

Benefits:
- Org-specific baselines (not generic thresholds)
- Churn risk detection
- Customer success actionable alerts
- Exact segmentation compliance
Major updates:
- Restructure to 6-part Agent Skills format (Overview → Constraints)
- Add all 6 shared knowledge files to Phase 2 references
- Add detailed data source explanation (ltxvapi tables and GPU cost table)
- Add percentile calculations (P50/P95/P99 latency)
- Add error type separation (infrastructure vs applicative)
- Add baseline comparisons (7-day rolling average)
- Add alert routing by error type (Engineering vs API/Product team)

Performance monitoring coverage:
- P95 latency spikes (> 2x baseline or > 60s)
- Error rate increases (> 5% or DoD > 50%)
- Throughput drops (> 30% DoD/WoW)
- Queue time issues (> 50% of processing time)
- Infrastructure errors (> 10 requests/hour)

Benefits:
- Comprehensive API performance monitoring
- Endpoint/model/org breakdown for root cause
- Error type routing to appropriate teams
- Clear structure (6-part spec)
- Baseline-driven alerting
- Remove duplications (data nuances, analysis patterns, routing)
- Consolidate 7 phases into 3 clean phases
- Apply statistical 2σ method like usage monitor
- Reduce from 200 to 111 lines (-44.5%)
- Update progress tracker to match 3-phase structure
- Clean up DO rules (19 → 7 essential rules)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants