feat: add production GPU cost thresholds and 3-day lookback

DanielB945 · DanielB945 · commit db9816dd34f6 · 2026-03-10T20:32:09.000+02:00
Major updates:
- Add production thresholds from 60-day statistical analysis
- Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low)
- Change timing: analyze data from 3 days ago (cost data needs time to finalize)
- Restructure to 6-part Agent Skills format
- Add per-vertical thresholds (API vs Studio)

Tier 1 alerts:
- Idle cost spike &gt; $15,600/day
- Inference cost spike &gt; $5,743/day
- Idle-to-inference ratio &gt; 4:1

Benefits:
- Data-driven thresholds (not guesses)
- Prioritized alerts by tier
- Proper timing (3-day lookback)
- Clear structure (6-part spec)
diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: be-cost-monitoring
-description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift.
+description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org."
 tags: [monitoring, costs, gpu, infrastructure, alerts]
 compatibility:
   - BigQuery access (ltx-dwh-prod-processed)
@@ -9,52 +9,98 @@ compatibility:
 
 # Backend Cost Monitoring
 
-## When to use
+## 1. Overview (Why?)
 
-- "Monitor GPU costs"
-- "Analyze cost trends (daily/weekly/monthly)"
-- "Detect cost anomalies or spikes"
-- "Break down API costs by endpoint/model/org"
-- "Break down Studio costs by process/workspace"
-- "Compare API vs Studio cost distribution"
-- "Investigate cost-per-request efficiency drift"
-- "Monitor GPU utilization and idle costs"
-- "Alert on cost budget breaches"
-- "Day-over-day or week-over-week cost comparisons"
+LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance.
 
-## Steps
+This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns.
 
-### 1. Gather Requirements
+**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms.
 
-Ask the user:
-- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies
-- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process?
-- **Time window**: Daily, weekly, monthly? How far back?
-- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns
-- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline)
+**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize).
 
-### 2. Read Shared Knowledge
+## 2. Requirements (What?)
 
-Before writing SQL:
-- **`shared/product-context.md`** — LTX products and business context
+Monitor these outcomes autonomously:
+
+- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown)
+- [ ] Inference cost spikes (volume surge, costlier model, heavy customer)
+- [ ] Idle-to-inference ratio degradation (utilization dropping)
+- [ ] Failure rate spikes (wasted compute + service quality issue)
+- [ ] Cost-per-request drift (model regression or resolution creep)
+- [ ] Day-over-day cost jumps (early warning signals)
+- [ ] Volume drops (potential outage)
+- [ ] Overhead spikes (system anomalies)
+- [ ] Alerts prioritized by tier (High/Medium/Low)
+- [ ] Vertical-specific thresholds (API vs Studio)
+
+## 3. Progress Tracker
+
+* [ ] Read production thresholds and shared knowledge
+* [ ] Select appropriate query template(s)
+* [ ] Execute query for 3 days ago
+* [ ] Analyze results by tier priority
+* [ ] Identify root cause (endpoint, model, org, process)
+* [ ] Present findings with cost breakdown
+* [ ] Route alerts to appropriate team (API/Studio/Engineering)
+
+## 4. Implementation Plan
+
+### Phase 1: Read Production Thresholds
+
+Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`:
+
+#### Tier 1 — High Priority (daily monitoring)
+| Alert | Threshold | Signal |
+|-------|-----------|--------|
+| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown |
+| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer |
+| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) |
+
+#### Tier 2 — Medium Priority (daily monitoring)
+| Alert | Threshold | Signal |
+|-------|-----------|--------|
+| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue |
+| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep |
+| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach |
+
+#### Tier 3 — Low Priority (weekly review)
+| Alert | Threshold | Signal |
+|-------|-----------|--------|
+| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue |
+| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) |
+
+**Per-Vertical Thresholds:**
+- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30%
+- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15%
+
+**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold
+
+### Phase 2: Read Shared Knowledge
+
+Before writing SQL, read:
+- **`shared/product-context.md`** — LTX products, user types, business model
 - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
 - **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
-- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
 - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
 - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks
+- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
 
-Key learnings:
+**Data nuances**:
 - Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
 - Partitioned by `dt` (DATE) — always filter for performance
-- `cost_category`: inference (requests), idle, overhead, unused
-- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only)
-- For infrastructure cost: `SUM(row_cost)` across all categories
+- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago)
+- Cost categories: inference, idle, overhead, unused
+- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only)
+- Infrastructure cost = `SUM(row_cost)` across all categories
 
-### 3. Run Query Templates
+### Phase 3: Select Query Template
 
-**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview.
+✅ **PREFERRED: Run all 11 templates for comprehensive overview**
 
-**If user specified specific analysis:** Select appropriate template:
+If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`:
+
+**If user specified specific analysis**, select appropriate template:
 
 | User asks... | Use template |
 |-------------|-------------|
@@ -68,9 +114,7 @@ Key learnings:
 | "Cost efficiency by model" | Cost per Request by Model |
 | "Which orgs cost most?" | API Cost by Organization |
 
-See `shared/gpu-cost-query-templates.md` for all 11 query templates.
-
-### 4. Execute Query
+### Phase 4: Execute Query
 
 Run query using:
 ```bash
@@ -79,84 +123,107 @@ bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
 "
 ```
 
-Or use BigQuery console with project `ltx-dwh-explore`.
+**Or** use BigQuery console with project `ltx-dwh-explore`
+
+**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)`
 
-### 5. Analyze Results
+### Phase 5: Analyze Results
 
-**For cost trends:**
+**For cost trends**:
 - Compare current period vs baseline (7-day avg, prior week, prior month)
 - Calculate % change and flag significant shifts (>15-20%)
 
-**For anomaly detection:**
+**For anomaly detection**:
 - Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg)
 - Investigate root cause: specific endpoint/model/org, error rate spike, billing type change
 
-**For breakdowns:**
+**For breakdowns**:
 - Identify top cost drivers (endpoint, model, org, process)
 - Calculate cost per request to spot efficiency issues
 - Check failure costs (wasted spend on errors)
 
-### 6. Present Findings
+### Phase 6: Present Findings
 
 Format results with:
-- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday")
+- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago")
 - **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%")
-- **Breakdown**: Top contributors by dimension
+- **Breakdown**: Top contributors by dimension (endpoint, model, org, process)
 - **Recommendation**: Action to take (investigate org X, optimize model Y, alert team)
+- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low)
 
-### 7. Set Up Alert (if requested)
+### Phase 7: Set Up Alert (Optional)
 
 For ongoing monitoring:
 1. Save SQL query
 2. Set up in BigQuery scheduled query or Hex Thread
-3. Configure notification threshold
-4. Route alerts to Slack channel or Linear issue
+3. Configure notification threshold by tier
+4. Route alerts to appropriate team:
+   - API cost spikes → API team
+   - Studio cost spikes → Studio team
+   - Infrastructure issues → Engineering team
 
-## Schema Reference
+## 5. Context & References
 
-For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`.
+### Production Thresholds
+- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis)
 
-## Reference Files
+### Shared Knowledge
+- **`shared/product-context.md`** — LTX products and business context
+- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
+- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
+- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
+- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks
+- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
+
+### Data Source
+Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
+- Partitioned by `dt` (DATE)
+- Division: `LTX` (LTX API + LTX Studio)
+- Cost categories: inference (27%), idle (72%), overhead (0.3%)
+- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name`
+
+### Query Templates
+See `shared/gpu-cost-query-templates.md` for 11 production-ready queries
 
-| File | Read when |
-|------|-----------|
-| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations |
-| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) |
-| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) |
-| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) |
-| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks |
+## 6. Constraints & Done
 
-## Rules
+### DO NOT
 
-### Query Best Practices
+- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize)
+- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting
+- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
+- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ)
+- **DO NOT** skip partition filtering — always filter on `dt` for performance
+
+### DO
 
 - **DO** always filter on `dt` partition column for performance
+- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date
 - **DO** filter `cost_category = 'inference'` for request-level analysis
-- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis
-- **DO** include LT team requests only when analyzing total infrastructure spend or debugging
+- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis
+- **DO** include LT team only when analyzing total infrastructure spend or debugging
 - **DO** use `ltx-dwh-explore` as execution project
 - **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero
-- **DO** compare against baseline (7-day avg, prior period) for trends
-- **DO** round cost values to 2 decimal places for readability
-
-### Cost Calculation
-
 - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request
 - **DO** use `SUM(row_cost)` across all rows for total infrastructure cost
-- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting)
-- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
-
-### Analysis
-
+- **DO** compare against baseline (7-day avg, prior period) for trends
+- **DO** round cost values to 2 decimal places for readability
 - **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs)
 - **DO** investigate failure costs (wasted spend on errors)
 - **DO** break down by endpoint/model for API, by process for Studio
 - **DO** check cost per request trends to spot efficiency degradation
 - **DO** validate results against total infrastructure spend
-
-### Alerts
-
 - **DO** set thresholds based on historical baseline, not absolute values
 - **DO** alert engineering team for cost spikes > 30% vs baseline
 - **DO** include cost breakdown and root cause in alerts
 - **DO** route API cost alerts to API team, Studio alerts to Studio team
+
+### Completion Criteria
+
+✅ All cost categories analyzed (idle, inference, overhead)
+✅ Production thresholds applied by tier (High/Medium/Low)
+✅ Alerts prioritized and routed to appropriate teams
+✅ Root cause investigation completed (endpoint/model/org/process)
+✅ Findings presented with cost breakdown and recommendations
+✅ Query uses 3-day lookback (not yesterday)
+✅ Partition filtering applied for performance