Skip to content

Commit 4f9ee74

Browse files
committed
Merge BE cost monitoring agent (PR #43)
2 parents 9a15908 + a77114e commit 4f9ee74

File tree

1 file changed

+69
-120
lines changed

1 file changed

+69
-120
lines changed

agents/monitoring/be-cost/SKILL.md

Lines changed: 69 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: be-cost-monitoring
3-
description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift.
3+
description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org."
44
tags: [monitoring, costs, gpu, infrastructure, alerts]
55
compatibility:
66
- BigQuery access (ltx-dwh-prod-processed)
@@ -9,154 +9,103 @@ compatibility:
99

1010
# Backend Cost Monitoring
1111

12-
## When to use
12+
## 1. Overview (Why?)
1313

14-
- "Monitor GPU costs"
15-
- "Analyze cost trends (daily/weekly/monthly)"
16-
- "Detect cost anomalies or spikes"
17-
- "Break down API costs by endpoint/model/org"
18-
- "Break down Studio costs by process/workspace"
19-
- "Compare API vs Studio cost distribution"
20-
- "Investigate cost-per-request efficiency drift"
21-
- "Monitor GPU utilization and idle costs"
22-
- "Alert on cost budget breaches"
23-
- "Day-over-day or week-over-week cost comparisons"
14+
This skill provides **autonomous cost monitoring** using statistical anomaly detection. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns.
2415

25-
## Steps
16+
**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms.
2617

27-
### 1. Gather Requirements
18+
**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize).
2819

29-
Ask the user:
30-
- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies
31-
- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process?
32-
- **Time window**: Daily, weekly, monthly? How far back?
33-
- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns
34-
- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline)
20+
## 2. Requirements (What?)
3521

36-
### 2. Read Shared Knowledge
22+
Monitor these outcomes autonomously:
3723

38-
Before writing SQL:
39-
- **`shared/product-context.md`** — LTX products and business context
24+
- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown)
25+
- [ ] Inference cost spikes (volume surge, costlier model, heavy customer)
26+
- [ ] Idle-to-inference ratio degradation (utilization dropping)
27+
- [ ] Failure rate spikes (wasted compute + service quality issue)
28+
- [ ] Cost-per-request drift (model regression or resolution creep)
29+
- [ ] Day-over-day cost jumps (early warning signals)
30+
- [ ] Volume drops (potential outage)
31+
- [ ] Overhead spikes (system anomalies)
32+
- [ ] Alerts prioritized by tier (High/Medium/Low)
33+
- [ ] Vertical-specific thresholds (API vs Studio)
34+
35+
## 3. Progress Tracker
36+
37+
* [ ] Read shared knowledge (schema, query templates, analysis patterns)
38+
* [ ] Execute monitoring query for 3 days ago
39+
* [ ] Calculate z-scores and classify severity (NOTICE/WARNING/CRITICAL)
40+
* [ ] Identify root cause (endpoint, model, org, process)
41+
* [ ] Present findings with statistical details and recommendations
42+
43+
## 4. Implementation Plan
44+
45+
### Phase 1: Read Shared Knowledge
46+
47+
Before analyzing, read:
4048
- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
41-
- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
42-
- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
4349
- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
4450
- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks
4551

46-
Key learnings:
47-
- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
48-
- Partitioned by `dt` (DATE) — always filter for performance
49-
- `cost_category`: inference (requests), idle, overhead, unused
50-
- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only)
51-
- For infrastructure cost: `SUM(row_cost)` across all categories
52+
**Statistical Method**: Last 10 same-day-of-week comparison with 2σ threshold
5253

53-
### 3. Run Query Templates
54+
**Alert logic**: `|current_value - μ| > 2σ`
5455

55-
**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview.
56+
Where:
57+
- **μ (mean)** = average of last 10 same-day-of-week values
58+
- **σ (stddev)** = standard deviation of last 10 same-day-of-week values
59+
- **z-score** = (current - μ) / σ
5660

57-
**If user specified specific analysis:** Select appropriate template:
61+
**Severity levels**:
62+
- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor
63+
- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate
64+
- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action
5865

59-
| User asks... | Use template |
60-
|-------------|-------------|
61-
| "What are total costs?" | Daily Total Cost (API + Studio) |
62-
| "Break down API costs" | API Cost by Endpoint & Model |
63-
| "Break down Studio costs" | Studio Cost by Process |
64-
| "Yesterday vs day before" | Day-over-Day Comparison |
65-
| "This week vs last week" | Week-over-Week Comparison |
66-
| "Detect cost spikes" | Anomaly Detection (Z-Score) |
67-
| "GPU utilization breakdown" | Utilization by Cost Category |
68-
| "Cost efficiency by model" | Cost per Request by Model |
69-
| "Which orgs cost most?" | API Cost by Organization |
66+
### Phase 2: Execute Monitoring Query
7067

71-
See `shared/gpu-cost-query-templates.md` for all 11 query templates.
68+
Run query for **3 days ago** (cost data needs time to finalize):
7269

73-
### 4. Execute Query
74-
75-
Run query using:
7670
```bash
7771
bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
78-
<query>
72+
<query from gpu-cost-query-templates.md>
7973
"
8074
```
8175

82-
Or use BigQuery console with project `ltx-dwh-explore`.
83-
84-
### 5. Analyze Results
85-
86-
**For cost trends:**
87-
- Compare current period vs baseline (7-day avg, prior week, prior month)
88-
- Calculate % change and flag significant shifts (>15-20%)
89-
90-
**For anomaly detection:**
91-
- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg)
92-
- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change
76+
**PREFERRED: Run all 11 templates for comprehensive overview**
9377

94-
**For breakdowns:**
95-
- Identify top cost drivers (endpoint, model, org, process)
96-
- Calculate cost per request to spot efficiency issues
97-
- Check failure costs (wasted spend on errors)
78+
### Phase 3: Analyze and Present Results
9879

99-
### 6. Present Findings
80+
Format findings with:
81+
- **Summary**: Key finding (e.g., "Idle costs spiked 45% on March 5")
82+
- **Severity**: NOTICE / WARNING / CRITICAL
83+
- **Z-score**: Statistical deviation from baseline
84+
- **Root cause**: Top contributors by endpoint/model/org/process
85+
- **Recommendation**: Action to take, team to notify
10086

101-
Format results with:
102-
- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday")
103-
- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%")
104-
- **Breakdown**: Top contributors by dimension
105-
- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team)
87+
## 5. Constraints & Done
10688

107-
### 7. Set Up Alert (if requested)
89+
### DO NOT
10890

109-
For ongoing monitoring:
110-
1. Save SQL query
111-
2. Set up in BigQuery scheduled query or Hex Thread
112-
3. Configure notification threshold
113-
4. Route alerts to Slack channel or Linear issue
91+
- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize)
92+
- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting
93+
- **DO NOT** skip partition filtering — always filter on `dt`
11494

115-
## Schema Reference
95+
### DO
11696

117-
For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`.
118-
119-
## Reference Files
120-
121-
| File | Read when |
122-
|------|-----------|
123-
| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations |
124-
| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) |
125-
| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) |
126-
| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) |
127-
| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks |
128-
129-
## Rules
130-
131-
### Query Best Practices
132-
133-
- **DO** always filter on `dt` partition column for performance
97+
- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as target date
13498
- **DO** filter `cost_category = 'inference'` for request-level analysis
135-
- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis
136-
- **DO** include LT team requests only when analyzing total infrastructure spend or debugging
137-
- **DO** use `ltx-dwh-explore` as execution project
138-
- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero
139-
- **DO** compare against baseline (7-day avg, prior period) for trends
140-
- **DO** round cost values to 2 decimal places for readability
141-
142-
### Cost Calculation
143-
99+
- **DO** exclude LT team with `is_lt_team IS FALSE` for customer-facing cost
100+
- **DO** use execution project `ltx-dwh-explore`
144101
- **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request
145-
- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost
146-
- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting)
147-
- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
148-
149-
### Analysis
150-
151-
- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs)
152-
- **DO** investigate failure costs (wasted spend on errors)
153-
- **DO** break down by endpoint/model for API, by process for Studio
154-
- **DO** check cost per request trends to spot efficiency degradation
155-
- **DO** validate results against total infrastructure spend
102+
- **DO** flag anomalies with Z-score > 2 (statistical threshold)
103+
- **DO** route alerts: API team (API costs), Studio team (Studio costs), Engineering (infrastructure)
156104

157-
### Alerts
105+
### Completion Criteria
158106

159-
- **DO** set thresholds based on historical baseline, not absolute values
160-
- **DO** alert engineering team for cost spikes > 30% vs baseline
161-
- **DO** include cost breakdown and root cause in alerts
162-
- **DO** route API cost alerts to API team, Studio alerts to Studio team
107+
✅ Query executed for 3 days ago with partition filtering
108+
✅ Statistical analysis applied (2σ threshold)
109+
✅ Alerts classified by severity (NOTICE/WARNING/CRITICAL)
110+
✅ Root cause identified (endpoint/model/org/process)
111+
✅ Results presented with cost breakdown and z-scores

0 commit comments

Comments
 (0)