diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index c2f177e..88f486e 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -1,6 +1,6 @@ --- name: be-cost-monitoring -description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift. +description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org." tags: [monitoring, costs, gpu, infrastructure, alerts] compatibility: - BigQuery access (ltx-dwh-prod-processed) @@ -9,154 +9,103 @@ compatibility: # Backend Cost Monitoring -## When to use +## 1. Overview (Why?) -- "Monitor GPU costs" -- "Analyze cost trends (daily/weekly/monthly)" -- "Detect cost anomalies or spikes" -- "Break down API costs by endpoint/model/org" -- "Break down Studio costs by process/workspace" -- "Compare API vs Studio cost distribution" -- "Investigate cost-per-request efficiency drift" -- "Monitor GPU utilization and idle costs" -- "Alert on cost budget breaches" -- "Day-over-day or week-over-week cost comparisons" +This skill provides **autonomous cost monitoring** using statistical anomaly detection. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. -## Steps +**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms. -### 1. Gather Requirements +**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize). -Ask the user: -- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies -- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process? -- **Time window**: Daily, weekly, monthly? How far back? -- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns -- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline) +## 2. Requirements (What?) -### 2. Read Shared Knowledge +Monitor these outcomes autonomously: -Before writing SQL: -- **`shared/product-context.md`** — LTX products and business context +- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown) +- [ ] Inference cost spikes (volume surge, costlier model, heavy customer) +- [ ] Idle-to-inference ratio degradation (utilization dropping) +- [ ] Failure rate spikes (wasted compute + service quality issue) +- [ ] Cost-per-request drift (model regression or resolution creep) +- [ ] Day-over-day cost jumps (early warning signals) +- [ ] Volume drops (potential outage) +- [ ] Overhead spikes (system anomalies) +- [ ] Alerts prioritized by tier (High/Medium/Low) +- [ ] Vertical-specific thresholds (API vs Studio) + +## 3. Progress Tracker + +* [ ] Read shared knowledge (schema, query templates, analysis patterns) +* [ ] Execute monitoring query for 3 days ago +* [ ] Calculate z-scores and classify severity (NOTICE/WARNING/CRITICAL) +* [ ] Identify root cause (endpoint, model, org, process) +* [ ] Present findings with statistical details and recommendations + +## 4. Implementation Plan + +### Phase 1: Read Shared Knowledge + +Before analyzing, read: - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) -- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks -Key learnings: -- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` -- Partitioned by `dt` (DATE) — always filter for performance -- `cost_category`: inference (requests), idle, overhead, unused -- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only) -- For infrastructure cost: `SUM(row_cost)` across all categories +**Statistical Method**: Last 10 same-day-of-week comparison with 2σ threshold -### 3. Run Query Templates +**Alert logic**: `|current_value - μ| > 2σ` -**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview. +Where: +- **μ (mean)** = average of last 10 same-day-of-week values +- **σ (stddev)** = standard deviation of last 10 same-day-of-week values +- **z-score** = (current - μ) / σ -**If user specified specific analysis:** Select appropriate template: +**Severity levels**: +- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor +- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate +- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action -| User asks... | Use template | -|-------------|-------------| -| "What are total costs?" | Daily Total Cost (API + Studio) | -| "Break down API costs" | API Cost by Endpoint & Model | -| "Break down Studio costs" | Studio Cost by Process | -| "Yesterday vs day before" | Day-over-Day Comparison | -| "This week vs last week" | Week-over-Week Comparison | -| "Detect cost spikes" | Anomaly Detection (Z-Score) | -| "GPU utilization breakdown" | Utilization by Cost Category | -| "Cost efficiency by model" | Cost per Request by Model | -| "Which orgs cost most?" | API Cost by Organization | +### Phase 2: Execute Monitoring Query -See `shared/gpu-cost-query-templates.md` for all 11 query templates. +Run query for **3 days ago** (cost data needs time to finalize): -### 4. Execute Query - -Run query using: ```bash bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " - + " ``` -Or use BigQuery console with project `ltx-dwh-explore`. - -### 5. Analyze Results - -**For cost trends:** -- Compare current period vs baseline (7-day avg, prior week, prior month) -- Calculate % change and flag significant shifts (>15-20%) - -**For anomaly detection:** -- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg) -- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change +✅ **PREFERRED: Run all 11 templates for comprehensive overview** -**For breakdowns:** -- Identify top cost drivers (endpoint, model, org, process) -- Calculate cost per request to spot efficiency issues -- Check failure costs (wasted spend on errors) +### Phase 3: Analyze and Present Results -### 6. Present Findings +Format findings with: +- **Summary**: Key finding (e.g., "Idle costs spiked 45% on March 5") +- **Severity**: NOTICE / WARNING / CRITICAL +- **Z-score**: Statistical deviation from baseline +- **Root cause**: Top contributors by endpoint/model/org/process +- **Recommendation**: Action to take, team to notify -Format results with: -- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday") -- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%") -- **Breakdown**: Top contributors by dimension -- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team) +## 5. Constraints & Done -### 7. Set Up Alert (if requested) +### DO NOT -For ongoing monitoring: -1. Save SQL query -2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold -4. Route alerts to Slack channel or Linear issue +- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize) +- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting +- **DO NOT** skip partition filtering — always filter on `dt` -## Schema Reference +### DO -For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`. - -## Reference Files - -| File | Read when | -|------|-----------| -| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations | -| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) | -| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) | -| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) | -| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks | - -## Rules - -### Query Best Practices - -- **DO** always filter on `dt` partition column for performance +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as target date - **DO** filter `cost_category = 'inference'` for request-level analysis -- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis -- **DO** include LT team requests only when analyzing total infrastructure spend or debugging -- **DO** use `ltx-dwh-explore` as execution project -- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero -- **DO** compare against baseline (7-day avg, prior period) for trends -- **DO** round cost values to 2 decimal places for readability - -### Cost Calculation - +- **DO** exclude LT team with `is_lt_team IS FALSE` for customer-facing cost +- **DO** use execution project `ltx-dwh-explore` - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request -- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost -- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting) -- **DO NOT** mix inference and non-inference rows in same aggregation without filtering - -### Analysis - -- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs) -- **DO** investigate failure costs (wasted spend on errors) -- **DO** break down by endpoint/model for API, by process for Studio -- **DO** check cost per request trends to spot efficiency degradation -- **DO** validate results against total infrastructure spend +- **DO** flag anomalies with Z-score > 2 (statistical threshold) +- **DO** route alerts: API team (API costs), Studio team (Studio costs), Engineering (infrastructure) -### Alerts +### Completion Criteria -- **DO** set thresholds based on historical baseline, not absolute values -- **DO** alert engineering team for cost spikes > 30% vs baseline -- **DO** include cost breakdown and root cause in alerts -- **DO** route API cost alerts to API team, Studio alerts to Studio team +✅ Query executed for 3 days ago with partition filtering +✅ Statistical analysis applied (2σ threshold) +✅ Alerts classified by severity (NOTICE/WARNING/CRITICAL) +✅ Root cause identified (endpoint/model/org/process) +✅ Results presented with cost breakdown and z-scores