Skip to content

Commit db9816d

Browse files
committed
feat: add production GPU cost thresholds and 3-day lookback
Major updates: - Add production thresholds from 60-day statistical analysis - Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low) - Change timing: analyze data from 3 days ago (cost data needs time to finalize) - Restructure to 6-part Agent Skills format - Add per-vertical thresholds (API vs Studio) Tier 1 alerts: - Idle cost spike > $15,600/day - Inference cost spike > $5,743/day - Idle-to-inference ratio > 4:1 Benefits: - Data-driven thresholds (not guesses) - Prioritized alerts by tier - Proper timing (3-day lookback) - Clear structure (6-part spec)
1 parent fdd2727 commit db9816d

File tree

1 file changed

+139
-72
lines changed

1 file changed

+139
-72
lines changed

agents/monitoring/be-cost/SKILL.md

Lines changed: 139 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: be-cost-monitoring
3-
description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift.
3+
description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org."
44
tags: [monitoring, costs, gpu, infrastructure, alerts]
55
compatibility:
66
- BigQuery access (ltx-dwh-prod-processed)
@@ -9,52 +9,98 @@ compatibility:
99

1010
# Backend Cost Monitoring
1111

12-
## When to use
12+
## 1. Overview (Why?)
1313

14-
- "Monitor GPU costs"
15-
- "Analyze cost trends (daily/weekly/monthly)"
16-
- "Detect cost anomalies or spikes"
17-
- "Break down API costs by endpoint/model/org"
18-
- "Break down Studio costs by process/workspace"
19-
- "Compare API vs Studio cost distribution"
20-
- "Investigate cost-per-request efficiency drift"
21-
- "Monitor GPU utilization and idle costs"
22-
- "Alert on cost budget breaches"
23-
- "Day-over-day or week-over-week cost comparisons"
14+
LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance.
2415

25-
## Steps
16+
This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns.
2617

27-
### 1. Gather Requirements
18+
**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms.
2819

29-
Ask the user:
30-
- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies
31-
- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process?
32-
- **Time window**: Daily, weekly, monthly? How far back?
33-
- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns
34-
- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline)
20+
**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize).
3521

36-
### 2. Read Shared Knowledge
22+
## 2. Requirements (What?)
3723

38-
Before writing SQL:
39-
- **`shared/product-context.md`** — LTX products and business context
24+
Monitor these outcomes autonomously:
25+
26+
- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown)
27+
- [ ] Inference cost spikes (volume surge, costlier model, heavy customer)
28+
- [ ] Idle-to-inference ratio degradation (utilization dropping)
29+
- [ ] Failure rate spikes (wasted compute + service quality issue)
30+
- [ ] Cost-per-request drift (model regression or resolution creep)
31+
- [ ] Day-over-day cost jumps (early warning signals)
32+
- [ ] Volume drops (potential outage)
33+
- [ ] Overhead spikes (system anomalies)
34+
- [ ] Alerts prioritized by tier (High/Medium/Low)
35+
- [ ] Vertical-specific thresholds (API vs Studio)
36+
37+
## 3. Progress Tracker
38+
39+
* [ ] Read production thresholds and shared knowledge
40+
* [ ] Select appropriate query template(s)
41+
* [ ] Execute query for 3 days ago
42+
* [ ] Analyze results by tier priority
43+
* [ ] Identify root cause (endpoint, model, org, process)
44+
* [ ] Present findings with cost breakdown
45+
* [ ] Route alerts to appropriate team (API/Studio/Engineering)
46+
47+
## 4. Implementation Plan
48+
49+
### Phase 1: Read Production Thresholds
50+
51+
Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`:
52+
53+
#### Tier 1 — High Priority (daily monitoring)
54+
| Alert | Threshold | Signal |
55+
|-------|-----------|--------|
56+
| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown |
57+
| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer |
58+
| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) |
59+
60+
#### Tier 2 — Medium Priority (daily monitoring)
61+
| Alert | Threshold | Signal |
62+
|-------|-----------|--------|
63+
| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue |
64+
| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep |
65+
| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach |
66+
67+
#### Tier 3 — Low Priority (weekly review)
68+
| Alert | Threshold | Signal |
69+
|-------|-----------|--------|
70+
| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue |
71+
| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) |
72+
73+
**Per-Vertical Thresholds:**
74+
- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30%
75+
- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15%
76+
77+
**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold
78+
79+
### Phase 2: Read Shared Knowledge
80+
81+
Before writing SQL, read:
82+
- **`shared/product-context.md`** — LTX products, user types, business model
4083
- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
4184
- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
42-
- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
4385
- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
4486
- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks
87+
- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
4588

46-
Key learnings:
89+
**Data nuances**:
4790
- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
4891
- Partitioned by `dt` (DATE) — always filter for performance
49-
- `cost_category`: inference (requests), idle, overhead, unused
50-
- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only)
51-
- For infrastructure cost: `SUM(row_cost)` across all categories
92+
- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago)
93+
- Cost categories: inference, idle, overhead, unused
94+
- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only)
95+
- Infrastructure cost = `SUM(row_cost)` across all categories
5296

53-
### 3. Run Query Templates
97+
### Phase 3: Select Query Template
5498

55-
**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview.
99+
**PREFERRED: Run all 11 templates for comprehensive overview**
56100

57-
**If user specified specific analysis:** Select appropriate template:
101+
If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`:
102+
103+
**If user specified specific analysis**, select appropriate template:
58104

59105
| User asks... | Use template |
60106
|-------------|-------------|
@@ -68,9 +114,7 @@ Key learnings:
68114
| "Cost efficiency by model" | Cost per Request by Model |
69115
| "Which orgs cost most?" | API Cost by Organization |
70116

71-
See `shared/gpu-cost-query-templates.md` for all 11 query templates.
72-
73-
### 4. Execute Query
117+
### Phase 4: Execute Query
74118

75119
Run query using:
76120
```bash
@@ -79,84 +123,107 @@ bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
79123
"
80124
```
81125

82-
Or use BigQuery console with project `ltx-dwh-explore`.
126+
**Or** use BigQuery console with project `ltx-dwh-explore`
127+
128+
**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)`
83129

84-
### 5. Analyze Results
130+
### Phase 5: Analyze Results
85131

86-
**For cost trends:**
132+
**For cost trends**:
87133
- Compare current period vs baseline (7-day avg, prior week, prior month)
88134
- Calculate % change and flag significant shifts (>15-20%)
89135

90-
**For anomaly detection:**
136+
**For anomaly detection**:
91137
- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg)
92138
- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change
93139

94-
**For breakdowns:**
140+
**For breakdowns**:
95141
- Identify top cost drivers (endpoint, model, org, process)
96142
- Calculate cost per request to spot efficiency issues
97143
- Check failure costs (wasted spend on errors)
98144

99-
### 6. Present Findings
145+
### Phase 6: Present Findings
100146

101147
Format results with:
102-
- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday")
148+
- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago")
103149
- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%")
104-
- **Breakdown**: Top contributors by dimension
150+
- **Breakdown**: Top contributors by dimension (endpoint, model, org, process)
105151
- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team)
152+
- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low)
106153

107-
### 7. Set Up Alert (if requested)
154+
### Phase 7: Set Up Alert (Optional)
108155

109156
For ongoing monitoring:
110157
1. Save SQL query
111158
2. Set up in BigQuery scheduled query or Hex Thread
112-
3. Configure notification threshold
113-
4. Route alerts to Slack channel or Linear issue
159+
3. Configure notification threshold by tier
160+
4. Route alerts to appropriate team:
161+
- API cost spikes → API team
162+
- Studio cost spikes → Studio team
163+
- Infrastructure issues → Engineering team
114164

115-
## Schema Reference
165+
## 5. Context & References
116166

117-
For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`.
167+
### Production Thresholds
168+
- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis)
118169

119-
## Reference Files
170+
### Shared Knowledge
171+
- **`shared/product-context.md`** — LTX products and business context
172+
- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615)
173+
- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13)
174+
- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries
175+
- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks
176+
- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs)
177+
178+
### Data Source
179+
Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
180+
- Partitioned by `dt` (DATE)
181+
- Division: `LTX` (LTX API + LTX Studio)
182+
- Cost categories: inference (27%), idle (72%), overhead (0.3%)
183+
- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name`
184+
185+
### Query Templates
186+
See `shared/gpu-cost-query-templates.md` for 11 production-ready queries
120187

121-
| File | Read when |
122-
|------|-----------|
123-
| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations |
124-
| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) |
125-
| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) |
126-
| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) |
127-
| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks |
188+
## 6. Constraints & Done
128189

129-
## Rules
190+
### DO NOT
130191

131-
### Query Best Practices
192+
- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize)
193+
- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting
194+
- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
195+
- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ)
196+
- **DO NOT** skip partition filtering — always filter on `dt` for performance
197+
198+
### DO
132199

133200
- **DO** always filter on `dt` partition column for performance
201+
- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date
134202
- **DO** filter `cost_category = 'inference'` for request-level analysis
135-
- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis
136-
- **DO** include LT team requests only when analyzing total infrastructure spend or debugging
203+
- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis
204+
- **DO** include LT team only when analyzing total infrastructure spend or debugging
137205
- **DO** use `ltx-dwh-explore` as execution project
138206
- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero
139-
- **DO** compare against baseline (7-day avg, prior period) for trends
140-
- **DO** round cost values to 2 decimal places for readability
141-
142-
### Cost Calculation
143-
144207
- **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request
145208
- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost
146-
- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting)
147-
- **DO NOT** mix inference and non-inference rows in same aggregation without filtering
148-
149-
### Analysis
150-
209+
- **DO** compare against baseline (7-day avg, prior period) for trends
210+
- **DO** round cost values to 2 decimal places for readability
151211
- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs)
152212
- **DO** investigate failure costs (wasted spend on errors)
153213
- **DO** break down by endpoint/model for API, by process for Studio
154214
- **DO** check cost per request trends to spot efficiency degradation
155215
- **DO** validate results against total infrastructure spend
156-
157-
### Alerts
158-
159216
- **DO** set thresholds based on historical baseline, not absolute values
160217
- **DO** alert engineering team for cost spikes > 30% vs baseline
161218
- **DO** include cost breakdown and root cause in alerts
162219
- **DO** route API cost alerts to API team, Studio alerts to Studio team
220+
221+
### Completion Criteria
222+
223+
✅ All cost categories analyzed (idle, inference, overhead)
224+
✅ Production thresholds applied by tier (High/Medium/Low)
225+
✅ Alerts prioritized and routed to appropriate teams
226+
✅ Root cause investigation completed (endpoint/model/org/process)
227+
✅ Findings presented with cost breakdown and recommendations
228+
✅ Query uses 3-day lookback (not yesterday)
229+
✅ Partition filtering applied for performance

0 commit comments

Comments
 (0)