Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 69 additions & 120 deletions agents/monitoring/be-cost/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: be-cost-monitoring
description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift.
description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org."
tags: [monitoring, costs, gpu, infrastructure, alerts]
compatibility:
- BigQuery access (ltx-dwh-prod-processed)
Expand All @@ -9,154 +9,103 @@ compatibility:

# Backend Cost Monitoring

## When to use
## 1. Overview (Why?)

- "Monitor GPU costs"
- "Analyze cost trends (daily/weekly/monthly)"
- "Detect cost anomalies or spikes"
- "Break down API costs by endpoint/model/org"
- "Break down Studio costs by process/workspace"
- "Compare API vs Studio cost distribution"
- "Investigate cost-per-request efficiency drift"
- "Monitor GPU utilization and idle costs"
- "Alert on cost budget breaches"
- "Day-over-day or week-over-week cost comparisons"
This skill provides **autonomous cost monitoring** using statistical anomaly detection. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns.

## Steps
**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms.

### 1. Gather Requirements
**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize).

Ask the user:
- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies
- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process?
- **Time window**: Daily, weekly, monthly? How far back?
- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns
- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline)
## 2. Requirements (What?)

### 2. Read Shared Knowledge
Monitor these outcomes autonomously:

Before writing SQL:
- **`shared/product-context.md`** — LTX products and business context
- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown)
- [ ] Inference cost spikes (volume surge, costlier model, heavy customer)
- [ ] Idle-to-inference ratio degradation (utilization dropping)
- [ ] Failure rate spikes (wasted compute + service quality issue)
- [ ] Cost-per-request drift (model regression or resolution creep)
- [ ] Day-over-day cost jumps (early warning signals)
- [ ] Volume drops (potential outage)
- [ ] Overhead spikes (system anomalies)
- [ ] Alerts prioritized by tier (High/Medium/Low)
- [ ] Vertical-specific thresholds (API vs Studio)

## 3. Progress Tracker

* [ ] Read shared knowledge (schema, query templates, analysis patterns)
* [ ] Execute monitoring query for 3 days ago
* [ ] Calculate z-scores and classify severity (NOTICE/WARNING/CRITICAL)
* [ ] Identify root cause (endpoint, model, org, process)
* [ ] Present findings with statistical details and recommendations

## 4. Implementation Plan

### Phase 1: Read Shared Knowledge

Before analyzing, read:
- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks

Key learnings:
- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
- Partitioned by `dt` (DATE) — always filter for performance
- `cost_category`: inference (requests), idle, overhead, unused
- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only)
- For infrastructure cost: `SUM(row_cost)` across all categories
**Statistical Method**: Last 10 same-day-of-week comparison with 2σ threshold

### 3. Run Query Templates
**Alert logic**: `|current_value - μ| > 2σ`

**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview.
Where:
- **μ (mean)** = average of last 10 same-day-of-week values
- **σ (stddev)** = standard deviation of last 10 same-day-of-week values
- **z-score** = (current - μ) / σ

**If user specified specific analysis:** Select appropriate template:
**Severity levels**:
- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor
- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate
- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action

| User asks... | Use template |
|-------------|-------------|
| "What are total costs?" | Daily Total Cost (API + Studio) |
| "Break down API costs" | API Cost by Endpoint & Model |
| "Break down Studio costs" | Studio Cost by Process |
| "Yesterday vs day before" | Day-over-Day Comparison |
| "This week vs last week" | Week-over-Week Comparison |
| "Detect cost spikes" | Anomaly Detection (Z-Score) |
| "GPU utilization breakdown" | Utilization by Cost Category |
| "Cost efficiency by model" | Cost per Request by Model |
| "Which orgs cost most?" | API Cost by Organization |
### Phase 2: Execute Monitoring Query

See `shared/gpu-cost-query-templates.md` for all 11 query templates.
Run query for **3 days ago** (cost data needs time to finalize):

### 4. Execute Query

Run query using:
```bash
bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
<query>
<query from gpu-cost-query-templates.md>
"
```

Or use BigQuery console with project `ltx-dwh-explore`.

### 5. Analyze Results

**For cost trends:**
- Compare current period vs baseline (7-day avg, prior week, prior month)
- Calculate % change and flag significant shifts (>15-20%)

**For anomaly detection:**
- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg)
- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change
✅ **PREFERRED: Run all 11 templates for comprehensive overview**

**For breakdowns:**
- Identify top cost drivers (endpoint, model, org, process)
- Calculate cost per request to spot efficiency issues
- Check failure costs (wasted spend on errors)
### Phase 3: Analyze and Present Results

### 6. Present Findings
Format findings with:
- **Summary**: Key finding (e.g., "Idle costs spiked 45% on March 5")
- **Severity**: NOTICE / WARNING / CRITICAL
- **Z-score**: Statistical deviation from baseline
- **Root cause**: Top contributors by endpoint/model/org/process
- **Recommendation**: Action to take, team to notify

Format results with:
- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday")
- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%")
- **Breakdown**: Top contributors by dimension
- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team)
## 5. Constraints & Done

### 7. Set Up Alert (if requested)
### DO NOT

For ongoing monitoring:
1. Save SQL query
2. Set up in BigQuery scheduled query or Hex Thread
3. Configure notification threshold
4. Route alerts to Slack channel or Linear issue
- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize)
- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting
- **DO NOT** skip partition filtering — always filter on `dt`

## Schema Reference
### DO

For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`.

## Reference Files

| File | Read when |
|------|-----------|
| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations |
| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) |
| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) |
| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) |
| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks |

## Rules

### Query Best Practices

- **DO** always filter on `dt` partition column for performance
- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as target date
- **DO** filter `cost_category = 'inference'` for request-level analysis
- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis
- **DO** include LT team requests only when analyzing total infrastructure spend or debugging
- **DO** use `ltx-dwh-explore` as execution project
- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero
- **DO** compare against baseline (7-day avg, prior period) for trends
- **DO** round cost values to 2 decimal places for readability

### Cost Calculation

- **DO** exclude LT team with `is_lt_team IS FALSE` for customer-facing cost
- **DO** use execution project `ltx-dwh-explore`
- **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request
- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost
- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting)
- **DO NOT** mix inference and non-inference rows in same aggregation without filtering

### Analysis

- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs)
- **DO** investigate failure costs (wasted spend on errors)
- **DO** break down by endpoint/model for API, by process for Studio
- **DO** check cost per request trends to spot efficiency degradation
- **DO** validate results against total infrastructure spend
- **DO** flag anomalies with Z-score > 2 (statistical threshold)
- **DO** route alerts: API team (API costs), Studio team (Studio costs), Engineering (infrastructure)

### Alerts
### Completion Criteria

- **DO** set thresholds based on historical baseline, not absolute values
- **DO** alert engineering team for cost spikes > 30% vs baseline
- **DO** include cost breakdown and root cause in alerts
- **DO** route API cost alerts to API team, Studio alerts to Studio team
✅ Query executed for 3 days ago with partition filtering
✅ Statistical analysis applied (2σ threshold)
✅ Alerts classified by severity (NOTICE/WARNING/CRITICAL)
✅ Root cause identified (endpoint/model/org/process)
✅ Results presented with cost breakdown and z-scores