feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage) by DanielB945 · Pull Request #51 · Lightricks/ltx-analytics-agents

Daniel Beer (DanielB945) · 2026-03-19T13:04:40Z

Summary

Consolidates 5 separate monitoring agent PRs into a single unified PR for easier review and deployment.

Consolidated PRs

This PR combines:

✅ PR feat: add BE cost monitoring with GPU cost analysis #43 - BE cost monitoring with GPU cost analysis
✅ PR feat: add API runtime monitoring agent #44 - API runtime monitoring agent
✅ PR feat: add enterprise account monitoring agent #45 - Enterprise account monitoring agent
✅ PR feat: add revenue monitoring agent #46 - Revenue monitoring agent
✅ PR feat: implement production usage monitoring with data-driven thresholds #38 - Production usage monitoring with data-driven thresholds
✅ NEW - Usage monitor README documentation

What's Included

1. BE Cost Monitoring (`agents/monitoring/be-cost/`)

Autonomous cost monitoring with statistical anomaly detection
Data-driven thresholds from 60-day analysis
Monitors: idle costs, inference costs, utilization ratio, failure rate, cost-per-request
Analyzes data from 3 days ago (cost data finalization delay)
Vertical-specific thresholds (API vs Studio)

2. API Runtime Monitoring (`agents/monitoring/api-runtime/`)

Latency, error rate, and throughput monitoring
Request volume tracking by endpoint/model
Performance degradation detection
SLA violation alerts

3. Enterprise Account Monitoring (`agents/monitoring/enterprise/`)

Account health scoring and churn risk assessment
Quota consumption and overage tracking
User activation and feature adoption
QBR preparation and customer success insights

4. Revenue Monitoring (`agents/monitoring/revenue/`)

MRR, ARR, subscription, and churn tracking
Refund and payment failure monitoring
Revenue anomaly detection
Cohort analysis and retention metrics

5. Usage Monitoring (`agents/monitoring/usage/`)

Statistical anomaly detection (2σ threshold)
Same-day-of-week baseline comparison (10 data points)
Segment-specific monitoring (Enterprise, Pilot, Pro, Standard, Lite, Free)
Autonomous Python script with BigQuery integration
NEW: Complete README with usage instructions and architecture

Key Features (All Agents)

✅ Statistical anomaly detection - Data-driven thresholds, not manual tuning
✅ Segment-aware - Different thresholds for Enterprise vs self-serve
✅ Day-of-week patterns - Accounts for weekly variance
✅ Prioritized alerts - High/Medium/Low severity tiers
✅ Autonomous execution - Designed for scheduled runs
✅ Shared knowledge integration - References bq-schema.md, metric-standards.md

Architecture

agents/monitoring/
  ├── SKILL.md                    ← Router agent (dispatches to sub-agents)
  ├── be-cost/SKILL.md           ← GPU cost monitoring
  ├── api-runtime/SKILL.md       ← API performance monitoring
  ├── enterprise/SKILL.md        ← Account health monitoring
  ├── revenue/SKILL.md           ← Financial metrics monitoring
  └── usage/
      ├── SKILL.md               ← Usage anomaly detection
      ├── README.md              ← NEW: Setup and usage guide
      ├── usage_monitor.py       ← Autonomous Python script
      ├── investigate_root_cause.sql ← Drill-down query
      └── references/
          └── query-templates.md ← SQL patterns

Test Plan

Review all monitoring agent workflows for consistency
Verify statistical thresholds match documented baselines
Test BigQuery access and permissions
Confirm alert severity prioritization logic
Validate segment definitions across all agents

Supersedes

This PR supersedes and should close:

PR feat: add BE cost monitoring with GPU cost analysis #43 (BE cost monitoring)
PR feat: add API runtime monitoring agent #44 (API runtime monitoring)
PR feat: add enterprise account monitoring agent #45 (Enterprise monitoring)
PR feat: add revenue monitoring agent #46 (Revenue monitoring)
PR feat: implement production usage monitoring with data-driven thresholds #38 (Usage monitoring)

Transform all monitoring agents to autonomous problem detectors and implement a production-ready usage monitoring system with statistically-derived alerting thresholds based on 60-day analysis. Key Changes: 1. Autonomous Monitoring Across All Agents - Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime) - Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis" - Auto-analyze ALL metrics, segments, and time windows without user prompting - Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline) 2. Production Usage Monitoring Implementation - Data-driven thresholds per segment based on 60-day volatility analysis - 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns) - Segment-specific thresholds: * Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only) * Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens * Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens * Free: -25% DAU, -35% Image Gens, -20% Tokens - Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold) - Weekend suppression for Enterprise (weekday-only alerting) - Skip Enterprise video generation alerts (CV > 100%, too volatile) 3. Root Cause Investigation Workflow - Enterprise segments: Drill down to organization level to identify which clients drove drops - Other segments: Analyze by tier distribution (Standard vs Pro vs Lite) - Alert format includes current vs baseline, drop %, threshold, and recommended actions 4. Full Segmentation Enforcement - Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md - Enforces proper hierarchy: Enterprise → Heavy → Paying → Free - Consistent segmentation across all usage monitoring queries Technical Details: - Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct) - 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt)) - Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date - Partition pruning on dt (DATE) for performance Anti-Patterns Addressed: - Generic thresholds replaced with segment-specific, data-driven values - Day-of-week effects handled via same-DOW comparison (not DoD on weekends) - Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy) - Enterprise video gen alerts skipped (single-user dominated, CV > 100%) - Small segment noise handled via production threshold calibration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Keep only production monitoring implementation (usage_monitoring_v2.py)

Reference data-driven thresholds from 60-day analysis: - Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio - Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump - Tier 3 Low Priority: Volume drop, Overhead spike - Per-vertical thresholds for LTX API and LTX Studio Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Include comprehensive shared knowledge files: - product-context.md for business model and user types - bq-schema.md for subscription tables and segmentation - metric-standards.md for revenue metric definitions - event-registry.yaml for feature-driven revenue analysis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Expand Step 2 to include all shared knowledge files: - product-context.md for LTX products and API context - bq-schema.md for API tables and GPU cost data - metric-standards.md for performance metrics - event-registry.yaml for event-driven metrics - gpu-cost-query-templates.md for cost-related performance - gpu-cost-analysis-patterns.md for cost analysis patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Convert to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Changes: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules + Anti-Patterns into Constraints section - Moved Reference Files + Production Scripts to Context section - Added completion criteria - Kept under 500 lines (402 lines total) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…rmat Convert all monitoring agents to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Updated skills: - be-cost-monitoring (314 lines) - GPU cost with production thresholds - revenue-monitor (272 lines) - Revenue/subscription monitoring - enterprise-monitor (341 lines) - Enterprise account health - api-runtime-monitor (359 lines) - API performance monitoring All skills: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules into Constraints section (DO/DO NOT) - Moved Reference Files to Context section - Added completion criteria - Kept under 500 lines per Agent Skills spec Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changes: - Replace separate SQL + Python files with combined usage_monitor.py - Update from percentage thresholds to statistical 3σ method - Embed SQL query as string in Python script - Execute BigQuery directly (no CSV intermediate) - Add --date parameter for flexible date selection - Update SKILL.md to reference combined script Benefits: - Single file to maintain (353 lines vs 413 lines) - No intermediate CSV files needed - Easier to run and schedule - Self-contained SQL + alerting logic

Change focus from 'detecting drops' to 'detecting data spikes' to reflect that the statistical method detects anomalies in both directions: - Increases (e.g., +46.6% token spike on 2026-03-09) - Decreases (e.g., churn, engagement drops) Updated: - Overview: Problem solved now mentions both increases and decreases - Requirements: Changed 'drops' to 'spikes (increases or decreases)' - Description: Changed 'detecting usage drops' to 'detecting usage anomalies'

Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context.

…isclosure Major changes: - Reduced from 335 lines to 182 lines (-45%) - Removed duplicate SQL query (already in usage_monitor.py) - Removed duplicate alert format examples - Consolidated overlapping phases (Phase 4 + 5 → Phase 4) - Simplified DO/DO NOT section (removed repetitive rules) - Applied progressive disclosure (method → run → analyze → present) - Kept only essential information, reference scripts for details Benefits: - Clearer, more focused instructions - Less maintenance (single source of truth in Python script) - Easier to scan and understand - Follows Agent Skills spec better (<500 lines, minimal duplication)

1. Change date example from specific date to 'yesterday' 2. Remove '(skip for Enterprise - too volatile)' from video generation line The exception details are still preserved in Phase 1 where they belong.

… exceptions - Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3) - Compare yesterday's metrics instead of today's - Remove enterprise video gen exceptions - Add event-registry.yaml to shared knowledge - Remove Phase 6, Context & References section - Add Free segment to tier distribution checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Major updates: - Add production thresholds from 60-day statistical analysis - Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low) - Change timing: analyze data from 3 days ago (cost data needs time to finalize) - Restructure to 6-part Agent Skills format - Add per-vertical thresholds (API vs Studio) Tier 1 alerts: - Idle cost spike > $15,600/day - Inference cost spike > $5,743/day - Idle-to-inference ratio > 4:1 Benefits: - Data-driven thresholds (not guesses) - Prioritized alerts by tier - Proper timing (3-day lookback) - Clear structure (6-part spec)

Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all shared knowledge files to Phase 2 references - Add comprehensive revenue metrics (MRR, ARR, churn, refunds, new subs) - Add segment-level analysis (tier + plan type) - Add baseline comparisons (7-day rolling average) - Add alert thresholds (generic, pending production analysis) Monitoring coverage: - Revenue drops by segment - MRR and ARR trends - Churn rate increases - Refund rate spikes - New subscription volume - Tier movements (upgrades/downgrades) - Enterprise contract renewals Benefits: - Comprehensive monitoring across all revenue metrics - Segment-level root cause analysis - Clear structure (6-part spec) - Baseline-driven alerting

Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Emphasize EXACT segmentation CTE usage from bq-schema.md (no modifications) - Add org-specific baseline comparisons (each org vs its 30-day average) - Add McCann split logic (McCann_NY vs McCann_Paris) - Add Contract vs Pilot account separation - Add power user tracking within orgs (top 20% by token usage) - Add quota monitoring and underutilization detection Enterprise-specific monitoring: - DAU/WAU/MAU drops per org (> 30% vs org baseline) - Token consumption vs contracted quota (< 50% utilization) - User activation (% of seats active) - Video/image generation engagement per org - Power user drops (> 20% decline) - Zero activity for 7+ consecutive days Benefits: - Org-specific baselines (not generic thresholds) - Churn risk detection - Customer success actionable alerts - Exact segmentation compliance

Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all 6 shared knowledge files to Phase 2 references - Add detailed data source explanation (ltxvapi tables and GPU cost table) - Add percentile calculations (P50/P95/P99 latency) - Add error type separation (infrastructure vs applicative) - Add baseline comparisons (7-day rolling average) - Add alert routing by error type (Engineering vs API/Product team) Performance monitoring coverage: - P95 latency spikes (> 2x baseline or > 60s) - Error rate increases (> 5% or DoD > 50%) - Throughput drops (> 30% DoD/WoW) - Queue time issues (> 50% of processing time) - Infrastructure errors (> 10 requests/hour) Benefits: - Comprehensive API performance monitoring - Endpoint/model/org breakdown for root cause - Error type routing to appropriate teams - Clear structure (6-part spec) - Baseline-driven alerting

- Remove duplications (data nuances, analysis patterns, routing) - Consolidate 7 phases into 3 clean phases - Apply statistical 2σ method like usage monitor - Reduce from 200 to 111 lines (-44.5%) - Update progress tracker to match 3-phase structure - Clean up DO rules (19 → 7 essential rules) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Daniel Beer (DanielB945) and others added 25 commits March 8, 2026 23:05

chore: remove experimental composite anomaly scoring

732ec76

Keep only production monitoring implementation (usage_monitoring_v2.py)

fix: GPU cost monitoring analyzes 3 days ago (not yesterday)

1467b24

Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove detailed usage patterns from Overview

79e2922

Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context.

Address PR review comments

88257fa

1. Change date example from specific date to 'yesterday' 2. Remove '(skip for Enterprise - too volatile)' from video generation line The exception details are still preserved in Phase 1 where they belong.

Merge BE cost monitoring agent (PR #43)

4f9ee74

Merge API runtime monitoring agent (PR #44)

d994a36

Merge enterprise account monitoring agent (PR #45)

ee9a25f

Merge revenue monitoring agent (PR #46)

39787ad

Merge production usage monitoring (PR #38) - resolved conflicts

e869600

Add usage monitor README documentation

2002c85

Daniel Beer (DanielB945) requested a review from Assaf Hay Eden (AssafHayEden) as a code owner March 19, 2026 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51

feat: Consolidate all monitoring agents (BE Cost, API Runtime, Enterprise, Revenue, Usage)#51
Daniel Beer (DanielB945) wants to merge 25 commits intomainfrom
consolidate-monitoring-agents

Daniel Beer (DanielB945) commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Daniel Beer (DanielB945) commented Mar 19, 2026

Summary

Consolidated PRs

What's Included

1. BE Cost Monitoring (agents/monitoring/be-cost/)

2. API Runtime Monitoring (agents/monitoring/api-runtime/)

3. Enterprise Account Monitoring (agents/monitoring/enterprise/)

4. Revenue Monitoring (agents/monitoring/revenue/)

5. Usage Monitoring (agents/monitoring/usage/)

Key Features (All Agents)

Architecture

Test Plan

Supersedes

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. BE Cost Monitoring (`agents/monitoring/be-cost/`)

2. API Runtime Monitoring (`agents/monitoring/api-runtime/`)

3. Enterprise Account Monitoring (`agents/monitoring/enterprise/`)

4. Revenue Monitoring (`agents/monitoring/revenue/`)

5. Usage Monitoring (`agents/monitoring/usage/`)