Merge BE cost monitoring agent (PR #43)

DanielB945 · DanielB945 · commit 4f9ee74d0c82 · 2026-03-19T14:58:28.000+02:00
diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: be-cost-monitoring
-description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift.
+description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org."
 tags: [monitoring, costs, gpu, infrastructure, alerts]
 compatibility:
   - BigQuery access (ltx-dwh-prod-processed)
@@ -9,154 +9,103 @@ compatibility:
 
 # Backend Cost Monitoring
 
-## When to use
+## 1. Overview (Why?)
 
-- "Monitor GPU costs"
-- "Analyze cost trends (daily/weekly/monthly)"
-- "Detect cost anomalies or spikes"
-- "Break down API costs by endpoint/model/org"
-- "Break down Studio costs by process/workspace"
-- "Compare API vs Studio cost distribution"
-- "Investigate cost-per-request efficiency drift"
-- "Monitor GPU utilization and idle costs"
-- "Alert on cost budget breaches"
-- "Day-over-day or week-over-week cost comparisons"
+This skill provides **autonomous cost monitoring** using statistical anomaly detection. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns.
 
-## Steps
+**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms.
 
-### 1. Gather Requirements
+**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize).
 
-Ask the user:
-- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies
-- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process?
-- **Time window**: Daily, weekly, monthly? How far back?
-- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns
-- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline)
+## 2. Requirements (What?)
 
-### 2. Read Shared Knowledge
+Monitor these outcomes autonomously:
 
-Before writing SQL:
-- **`shared/product-context.md`** — LTX products and business context
+- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown)
+- [ ] Inference cost spikes (volume surge, costlier model, heavy customer)
+- [ ] Idle-to-inference ratio degradation (utilization dropping)
+- [ ] Failure rate spikes (wasted compute + service quality issue)
+- [ ] Cost-per-request drift (model regression or resolution creep)
+- [ ] Day-over-day cost jumps (early warning signals)
+- [ ] Volume drops (potential outage)
+- [ ] Overhead spikes (system anomalies)
+- [ ] Alerts prioritized by tier (High/Medium/Low)
+- [ ] Vertical-specific thresholds (API vs Studio)
+
+## 3. Progress Tracker
+
+* [ ] Read shared knowledge (schema, query templates, analysis patterns)
+* [ ] Execute monitoring query for 3 days ago
+* [ ] Calculate z-scores and classify severity (NOTICE/WARNING/CRITICAL)
+* [ ] Identify root cause (endpoint, model, org, process)
+* [ ] Present findings with statistical details and recommendations
+
+## 4. Implementation Plan
+
+### Phase 1: Read Shared Knowledge
+
+Before analyzing, read:
 - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
-- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
-- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
 - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
 - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks
 
-Key learnings:
-- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
-- Partitioned by `dt` (DATE) — always filter for performance
-- `cost_category`: inference (requests), idle, overhead, unused
-- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only)
-- For infrastructure cost: `SUM(row_cost)` across all categories
+**Statistical Method**: Last 10 same-day-of-week comparison with 2σ threshold
 
-### 3. Run Query Templates
+**Alert logic**: `|current_value - μ| > 2σ`
 
-**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview.
+Where:
+- **μ (mean)** = average of last 10 same-day-of-week values
+- **σ (stddev)** = standard deviation of last 10 same-day-of-week values
+- **z-score** = (current - μ) / σ
 
-**If user specified specific analysis:** Select appropriate template:
+**Severity levels**:
+- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor
+- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate
+- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action
 
-| User asks... | Use template |
-|-------------|-------------|
-| "What are total costs?" | Daily Total Cost (API + Studio) |
-| "Break down API costs" | API Cost by Endpoint & Model |
-| "Break down Studio costs" | Studio Cost by Process |
-| "Yesterday vs day before" | Day-over-Day Comparison |
-| "This week vs last week" | Week-over-Week Comparison |
-| "Detect cost spikes" | Anomaly Detection (Z-Score) |
-| "GPU utilization breakdown" | Utilization by Cost Category |
-| "Cost efficiency by model" | Cost per Request by Model |
-| "Which orgs cost most?" | API Cost by Organization |
+### Phase 2: Execute Monitoring Query
 
-See `shared/gpu-cost-query-templates.md` for all 11 query templates.
+Run query for **3 days ago** (cost data needs time to finalize):
 
-### 4. Execute Query
-
-Run query using:
 ```bash
 bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
-<query>
+<query from gpu-cost-query-templates.md>
 "
 ```
 
-Or use BigQuery console with project `ltx-dwh-explore`.
-
-### 5. Analyze Results
-
-**For cost trends:**
-- Compare current period vs baseline (7-day avg, prior week, prior month)
-- Calculate % change and flag significant shifts (>15-20%)
-
-**For anomaly detection:**
-- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg)
-- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change
+✅ **PREFERRED: Run all 11 templates for comprehensive overview**
 
-**For breakdowns:**
-- Identify top cost drivers (endpoint, model, org, process)
-- Calculate cost per request to spot efficiency issues
-- Check failure costs (wasted spend on errors)
+### Phase 3: Analyze and Present Results
 
-### 6. Present Findings
+Format findings with:
+- **Summary**: Key finding (e.g., "Idle costs spiked 45% on March 5")
+- **Severity**: NOTICE / WARNING / CRITICAL
+- **Z-score**: Statistical deviation from baseline
+- **Root cause**: Top contributors by endpoint/model/org/process
+- **Recommendation**: Action to take, team to notify
 
-Format results with:
-- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday")
-- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%")
-- **Breakdown**: Top contributors by dimension
-- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team)
+## 5. Constraints & Done
 
-### 7. Set Up Alert (if requested)
+### DO NOT
 
-For ongoing monitoring:
-1. Save SQL query
-2. Set up in BigQuery scheduled query or Hex Thread
-3. Configure notification threshold
-4. Route alerts to Slack channel or Linear issue
+- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize)
+- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting
+- **DO NOT** skip partition filtering — always filter on `dt`
 
-## Schema Reference
+### DO
 
-For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`.
-
-## Reference Files
-
-| File | Read when |
-|------|-----------|
-| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations |
-| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) |
-| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) |
-| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) |
-| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks |
-
-## Rules
-
-### Query Best Practices
-
-- **DO** always filter on `dt` partition column for performance
+- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as target date
 - **DO** filter `cost_category = 'inference'` for request-level analysis
-- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis
-- **DO** include LT team requests only when analyzing total infrastructure spend or debugging
-- **DO** use `ltx-dwh-explore` as execution project
-- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero
-- **DO** compare against baseline (7-day avg, prior period) for trends
-- **DO** round cost values to 2 decimal places for readability
-
-### Cost Calculation
-
+- **DO** exclude LT team with `is_lt_team IS FALSE` for customer-facing cost
+- **DO** use execution project `ltx-dwh-explore`
 - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request
-- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost
-- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting)
-- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
-
-### Analysis
-
-- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs)
-- **DO** investigate failure costs (wasted spend on errors)
-- **DO** break down by endpoint/model for API, by process for Studio
-- **DO** check cost per request trends to spot efficiency degradation
-- **DO** validate results against total infrastructure spend
+- **DO** flag anomalies with Z-score > 2 (statistical threshold)
+- **DO** route alerts: API team (API costs), Studio team (Studio costs), Engineering (infrastructure)
 
-### Alerts
+### Completion Criteria
 
-- **DO** set thresholds based on historical baseline, not absolute values
-- **DO** alert engineering team for cost spikes > 30% vs baseline
-- **DO** include cost breakdown and root cause in alerts
-- **DO** route API cost alerts to API team, Studio alerts to Studio team
+✅ Query executed for 3 days ago with partition filtering
+✅ Statistical analysis applied (2σ threshold)
+✅ Alerts classified by severity (NOTICE/WARNING/CRITICAL)
+✅ Root cause identified (endpoint/model/org/process)
+✅ Results presented with cost breakdown and z-scores